BeautifulSoup.select方法

此腳本假設採用命令行字符串並通過谷歌搜索引擎運行，然後如果找到結果，它將打開不同選項卡中的前5個。我有一些問題試圖讓它起作用。我認爲問題發生在link = soup.select(".r a")的底部，我一直在改變這裏的值，然後它會顯示下一行的實際長度。但像這樣運行顯示的長度仍然是0.我試圖刮擦.r類和一個標籤，因爲這似乎是搜索結果開始在谷歌結果源代碼的位置。BeautifulSoup.select方法

import requests 
import bs4 
import sys 
import webbrowser 

print("Googling...") 
response = requests.get("https://www.google.com/#q=" + " ".join(sys.argv[1:])) 
response.raise_for_status() 

'''Function to return the top search result links''' 
soup = bs4.BeautifulSoup(response.text, "html.parser") 

'''Open a browser tab for each result''' 
links = soup.select(".r a") 
print(len(links)) 
numOpen = min(5, len(links)) 

for i in range(numOpen): 
    webbrowser.open("https://google.com/#q=" + links[i].get("href"))

來源

2016-12-24 Tarrell13

這是因爲谷歌是一個JavaScript重的網站。您在「響應」對象中獲得的HTML幾乎不包含JavaScript源鏈接。在獲得任何搜索結果之前，必須獲取並執行JavaScript。 –

我建議你使用像[PhantomJS]（http://phantomjs.org/）這樣的完整網絡爬蟲 –

@Abdelhakim Akodadi Google是一個JavaScript重量級網站，但我注意到'response.text'有一個完整的HTML鏈接。邏輯是對的，只是鏈接不對。 – Eddie

你的邏輯是正確的，除了谷歌搜索的URL是不正確的。

它得是

response = requests.get("https://www.google.com/search?q=" + " ".join(sys.argv[1:])) 
... 
for i in range(numOpen): 
    webbrowser.open("https://www.google.com" + links[i].get("href"))

下面是完整的代碼：

import requests 
import bs4 
import sys 
import webbrowser 

print("Googling...") 
response = requests.get("https://www.google.com/search?q=" + " ".join(sys.argv[1:])) 
response.raise_for_status() 

'''Function to return the top search result links''' 
soup = bs4.BeautifulSoup(response.text, "html.parser") 

'''Open a browser tab for each result''' 
links = soup.select(".r a") 
print(len(links)) 
numOpen = min(5, len(links)) 

for i in range(numOpen): 
    webbrowser.open("https://www.google.com" + links[i].get("href"))

來源

2016-12-24 02:40:27 Eddie

哇...在響應字段中的一個小改變修復了整個事情:)。 – Tarrell13

BeautifulSoup.select方法

回答

相關問題