用Beautifulsoup-Python進行破口

我想從鏈接中抓取一些數據：http://www.airlinequality.com/airline-reviews/vietjetair/?sortby=post_date%3ADesc&pagesize=100 例如我正在用BeautifulSoup來提取每個審閱者的名字，但它不起作用。我曾嘗試過使用BeautifulSoup與其他網站，它完美的工作！我不知道發生了什麼。你可以幫我嗎。代碼如下：用Beautifulsoup-Python進行破口

from bs4 import BeautifulSoup 
import os 
import urllib.request 


file1 = open(os.path.expanduser(r"~/Desktop/Skytrax Reviews1.csv"), "wb") 

file1.write(b"Reviewer" + b"\n") 

WebSites = ["http://www.airlinequality.com/airline-reviews/vietjetair/?sortby=post_date%3ADesc&pagesize=100"] 


# looping through each site until it hits a break. I will create a loop. It is not ready yet 
for theurl in WebSites: 
    thepage = urllib.request.urlopen(theurl) 
    print(thepage) 
    soup = BeautifulSoup(thepage,'lxml') 
    print(soup) #<-------This is the main problem 

#Maybe it is not correct too but the main problem is at the above lines 
    for Reviewer in soup.findAll(attrs={"class": "text_sub_header userStatusWrapper"}).text: 
     print(Reviewer) 

     Record1 = Reviewer 
     file1.write(bytes(Record1, encoding="ascii", errors='ignore') + b"\n") 


file1.close()

來源

2017-03-31 T.Athanas

該網站沒有返回你的瀏覽器中看到，嘗試：

wget -qO- http://www.airlinequality.com/airline-reviews/vietjetair/?sortby=post_date%3ADesc&pagesize=100

或嘗試更改請求的用戶代理。

來源

2017-03-31 14:54:06

@Rusa_x謝謝您的回答。我是新的python我使用相同的鏈接作爲你的。 –

如果您使用Chrome Network Tools或Firebug打開該網站，您會注意到它使用cookies來驗證請求。

您可以通過使用Python創建dict來模擬cookie，並將它們與您的請求一起發送。

在我的示例中，我使用的是requests。另外，你不應該在你的循環中放入.text，它會給你一個錯誤。

from bs4 import BeautifulSoup 
import requests 

cookies = { 
'PHPSESSID':'1gd0sknluds2uvumsglth523g5', 
'visid_incap_965359':'UGNtvJR1TAmP1y+/M85QuJ1s3lgAAAAAQUIPAAAAAAB5IOYuRCw/9mMOpTnRDCJ6', 
'incap_ses_315_965359':'PRZ8WIgqnhyeicz5PxxfBLFs3lgAAAAAYWoblc6exwqhEeGRPqgA5Q==' 
} 

response = requests.get('http://www.airlinequality.com/airline- 
reviews/vietjetair/?sortby=post_date%3ADesc&pagesize=100', cookies=cookies) 
soup = BeautifulSoup(response.content, "html.parser") 
for Reviewer in soup.findAll(attrs={"class": "text_sub_header userStatusWrapper"}): 
    print(Reviewer.get_text(strip=True))

來源

2017-03-31 15:04:16 Zroq

謝謝你的回答！但形成湯= BeautifulSoup（response.content，「html.parser」）我採取相同的答案< meta content =「telephone = no」name =「format-detection」>

使用不同的cookie值 - 轉到網絡工具並找到您的測試; – Zroq

@ Zrop。我更改cookie但代碼打印（Reviewer.get_text（strip = True））不打印任何東西 –

用Beautifulsoup-Python進行破口

回答

相關問題