2017-03-31 81 views
0

enter image description here我想從鏈接中抓取一些數據:http://www.airlinequality.com/airline-reviews/vietjetair/?sortby=post_date%3ADesc&pagesize=100 例如我正在用BeautifulSoup來提取每個審閱者的名字,但它不起作用。我曾嘗試過使用BeautifulSoup與其他網站,它完美的工作!我不知道發生了什麼。你可以幫我嗎。代碼如下:用Beautifulsoup-Python進行破口

from bs4 import BeautifulSoup 
import os 
import urllib.request 


file1 = open(os.path.expanduser(r"~/Desktop/Skytrax Reviews1.csv"), "wb") 

file1.write(b"Reviewer" + b"\n") 

WebSites = ["http://www.airlinequality.com/airline-reviews/vietjetair/?sortby=post_date%3ADesc&pagesize=100"] 


# looping through each site until it hits a break. I will create a loop. It is not ready yet 
for theurl in WebSites: 
    thepage = urllib.request.urlopen(theurl) 
    print(thepage) 
    soup = BeautifulSoup(thepage,'lxml') 
    print(soup) #<-------This is the main problem 

#Maybe it is not correct too but the main problem is at the above lines 
    for Reviewer in soup.findAll(attrs={"class": "text_sub_header userStatusWrapper"}).text: 
     print(Reviewer) 

     Record1 = Reviewer 
     file1.write(bytes(Record1, encoding="ascii", errors='ignore') + b"\n") 


file1.close() 

回答

0

該網站沒有返回你的瀏覽器中看到,嘗試:

wget -qO- http://www.airlinequality.com/airline-reviews/vietjetair/?sortby=post_date%3ADesc&pagesize=100 

或嘗試更改請求的用戶代理。

+0

@Rusa_x謝謝您的回答。我是新的python我使用相同的鏈接作爲你的。 –

0

如果您使用Chrome Network ToolsFirebug打開該網站,您會注意到它使用cookies來驗證請求。

您可以通過使用Python創建dict來模擬cookie,並將它們與您的請求一起發送。

在我的示例中,我使用的是requests。另外,你不應該在你的循環中放入.text,它會給你一個錯誤。

from bs4 import BeautifulSoup 
import requests 

cookies = { 
'PHPSESSID':'1gd0sknluds2uvumsglth523g5', 
'visid_incap_965359':'UGNtvJR1TAmP1y+/M85QuJ1s3lgAAAAAQUIPAAAAAAB5IOYuRCw/9mMOpTnRDCJ6', 
'incap_ses_315_965359':'PRZ8WIgqnhyeicz5PxxfBLFs3lgAAAAAYWoblc6exwqhEeGRPqgA5Q==' 
} 

response = requests.get('http://www.airlinequality.com/airline- 
reviews/vietjetair/?sortby=post_date%3ADesc&pagesize=100', cookies=cookies) 
soup = BeautifulSoup(response.content, "html.parser") 
for Reviewer in soup.findAll(attrs={"class": "text_sub_header userStatusWrapper"}): 
    print(Reviewer.get_text(strip=True)) 

Cookies

+0

謝謝你的回答!但形成湯= BeautifulSoup(response.content,「html.parser」)我採取相同的答案< meta content =「telephone = no」name =「format-detection」>

+0

使用不同的cookie值 - 轉到網絡工具並找到您的測試; – Zroq

+0

@ Zrop。我更改cookie但代碼打印(Reviewer.get_text(strip = True))不打印任何東西 –