2008-10-13 55 views
3

給出一串關鍵字,比如「Python最佳實踐」,我想從Python腳本中獲得前10個堆棧溢出問題,這些問題包含關鍵字,按照相關性(?)排序。我的目標是結束元組列表(標題,URL)。如何從腳本中搜索堆棧溢出問題?

我該如何做到這一點?您會考慮查詢Google嗎? (你將如何從Python中做到這一點?)

回答

5
>>> from urllib import urlencode 
>>> params = urlencode({'q': 'python best practices', 'sort': 'relevance'}) 
>>> params 
'q=python+best+practices&sort=relevance' 
>>> from urllib2 import urlopen 
>>> html = urlopen("http://stackoverflow.com/search?%s" % params).read() 
>>> import re 
>>> links = re.findall(r'<h3><a href="([^"]*)" class="answer-title">([^<]*)</a></h3>', html) 
>>> links 
[('/questions/5119/what-are-the-best-rss-feeds-for-programmersdevelopers#5150', 'What are the best RSS feeds for programmers/developers?'), ('/questions/3088/best-ways-to-teach-a-beginner-to-program#13185', 'Best ways to teach a beginner to program?'), ('/questions/13678/textual-versus-graphical-programming-languages#13886', 'Textual versus Graphical Programming Languages'), ('/questions/58968/what-defines-pythonian-or-pythonic#59877', 'What defines &#8220;pythonian&#8221; or &#8220;pythonic&#8221;?'), ('/questions/592/cxoracle-how-do-i-access-oracle-from-python#62392', 'cx_Oracle - How do I access Oracle from Python? '), ('/questions/7170/recommendation-for-straight-forward-python-frameworks#83608', 'Recommendation for straight-forward python frameworks'), ('/questions/100732/why-is-if-not-someobj-better-than-if-someobj-none-in-python#100903', 'Why is if not someobj: better than if someobj == None: in Python?'), ('/questions/132734/presentations-on-switching-from-perl-to-python#134006', 'Presentations on switching from Perl to Python'), ('/questions/136977/after-c-python-or-java#138442', 'After C++ - Python or Java?')] 
>>> from urlparse import urljoin 
>>> links = [(urljoin('http://stackoverflow.com/', url), title) for url,title in links] 
>>> links 
[('http://stackoverflow.com/questions/5119/what-are-the-best-rss-feeds-for-programmersdevelopers#5150', 'What are the best RSS feeds for programmers/developers?'), ('http://stackoverflow.com/questions/3088/best-ways-to-teach-a-beginner-to-program#13185', 'Best ways to teach a beginner to program?'), ('http://stackoverflow.com/questions/13678/textual-versus-graphical-programming-languages#13886', 'Textual versus Graphical Programming Languages'), ('http://stackoverflow.com/questions/58968/what-defines-pythonian-or-pythonic#59877', 'What defines &#8220;pythonian&#8221; or &#8220;pythonic&#8221;?'), ('http://stackoverflow.com/questions/592/cxoracle-how-do-i-access-oracle-from-python#62392', 'cx_Oracle - How do I access Oracle from Python? '), ('http://stackoverflow.com/questions/7170/recommendation-for-straight-forward-python-frameworks#83608', 'Recommendation for straight-forward python frameworks'), ('http://stackoverflow.com/questions/100732/why-is-if-not-someobj-better-than-if-someobj-none-in-python#100903', 'Why is if not someobj: better than if someobj == None: in Python?'), ('http://stackoverflow.com/questions/132734/presentations-on-switching-from-perl-to-python#134006', 'Presentations on switching from Perl to Python'), ('http://stackoverflow.com/questions/136977/after-c-python-or-java#138442', 'After C++ - Python or Java?')] 

將其轉換爲函數應該是微不足道的。

編輯:哎呀,我會做到這一點...

def get_stackoverflow(query): 
    import urllib, urllib2, re, urlparse 
    params = urllib.urlencode({'q': query, 'sort': 'relevance'}) 
    html = urllib2.urlopen("http://stackoverflow.com/search?%s" % params).read() 
    links = re.findall(r'<h3><a href="([^"]*)" class="answer-title">([^<]*)</a></h3>', html) 
    links = [(urlparse.urljoin('http://stackoverflow.com/', url), title) for url,title in links] 

    return links 
1

你可以屏幕從有效的HTTP請求中刮取返回的HTML。但那會導致不良的業力,以及喪失享受美好睡眠的能力。

4

既然Stackoverflow已經具備了這個功能,那麼您只需要獲取搜索結果頁面的內容並抓取所需的信息即可。下面是通過相關搜索的網址:

https://stackoverflow.com/search?q=python+best+practices&sort=relevance

如果您查看源代碼,你會看到,你需要爲每個問題的信息是這樣的一行:

<h3><a href="https://stackoverflow.com/questions/5119/what-are-the-best-rss-feeds-for-programmersdevelopers#5150" class="answer-title">What are the best RSS feeds for programmers/developers?</a></h3> 

所以你應該能夠通過對錶單字符串進行正則表達式搜索來獲得前10位。

0

我只是使用Pycurl將搜索條件連接到查詢uri上。