從Google新聞獲取鏈接列表

我使用Python的BeautifulSoup將Google新聞鏈接作爲列表。這是我走到這一步：從Google新聞獲取鏈接列表

import requests 
from bs4 import BeautifulSoup 
import re 
#url is just some google link, not to worried about being able to search from Python code 
url = "https://www.google.com.mx/search?biw=1526&bih=778&tbm=nws&q=amazon&oq=amazon&gs_l=serp.3..0l10.1377.2289.0.2359.7.7.0.0.0.0.116.508.5j1.6.0....0...1.1.64.serp..1.5.430.0.19SoRsczxCA" 
#this part of the code avoids error 403, we need to identify ourselves 
browser = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7' 
headers={'User-Agent':browser,} 
#getting our html 
page = requests.get(url) 
soup = BeautifulSoup(page.content, "lxml") 
#looking for links and adding them up as a list 
links = soup.findAll("a") 
for link in soup.find_all("a",href=re.compile("(?<=/url\?q=)(htt.*://.*)")): 
list=(re.split(":(?=http)",link["href"].replace("/url?q=",""))) 
print(list)

我的問題是：爲什麼不某些環節的工作？例如：

Forbes El Financiero El Mundo Cnet

來源

2017-06-02 Rafael Martínez

此代碼應工作：

import requests 
from bs4 import BeautifulSoup 
import re 

url = "https://www.google.com.mx/search?biw=1526&bih=778&tbm=nws&q=amazon&oq=amazon&gs_l=serp.3..0l10.1377.2289.0.2359.7.7.0.0.0.0.116.508.5j1.6.0....0...1.1.64.serp..1.5.430.0.19SoRsczxCA" 
browser = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7' 
headers = {'User-Agent':browser,} 
page = requests.get(url) 
soup = BeautifulSoup(page.content, "lxml") 
links = soup.findAll("a") 

l = [] 

for link in soup.find_all("a",href=re.compile("(?<=/url\?q=)(htt.*://.*)")): 
    l.append(re.split(":(?=http)",link["href"].replace("/url?q=",""))[0]) 

print(l)

一些注意事項：

切勿使用list作爲變量名！這是列表類型的保留字！
如果你想鏈接。你應該將它們追加到你的列表中，而不是覆蓋你的變量！爲此目的使用list.append方法。
re.split返回列表，你應該選擇第一個變量（這就是爲什麼我使用[0]）。

來源

2017-06-02 17:35:45

您提到的所有鏈接都顯示在瀏覽器中打開時出現「404頁面未找到」錯誤。因此鏈接被破壞或死亡。你可以參考這個wiki

你需要檢查網址的response status code，然後用BeautifulSoup解析頁面內容。

... 
page = requests.get(url) 
if page.status_code == requests.codes.ok: 
    soup = BeautifulSoup(page.content, "lxml") 
    ....

來源

2017-06-02 17:40:43

從Google新聞獲取鏈接列表

回答

相關問題