2017-05-23 42 views
0

我嘗試以下操作:Beautifulsoup無法讀取

from urllib2 import urlopen 
from BeautifulSoup import BeautifulSoup 
url = 'http://propaccess.traviscad.org/clientdb/Property.aspx?prop_id=312669' 
soup = BeautifulSoup(urlopen(url).read()) 
print soup 

以上print聲明顯示如下:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" 
     "http://www.w3.org/TR/html4/loose.dtd"> 
<html> 
<head> 
<meta http-equiv="Content-type" content="text/html;charset=utf-8" /> 
<title>Travis Property Search</title> 
<style type="text/css"> 
     body { text-align: center; padding: 150px; } 
     h1 { font-size: 50px; } 
     body { font: 20px Helvetica, sans-serif; color: #333; } 
     #article { display: block; text-align: left; width: 650px; margin: 0 auto; } 
     a { color: #dc8100; text-decoration: none; } 
     a:hover { color: #333; text-decoration: none; } 
    </style> 
</head> 
<body> 
<div id="article"> 
<h1>Please try again</h1> 
<div> 
<p>Sorry for the inconvenience but your session has either timed out or the server is busy handling other requests. You may visit us on the the following website for information, otherwise please retry your search again shortly:<br /><br /> 
<a href="http://www.traviscad.org/">Travis Central Appraisal District Website</a> </p> 
<p><b><a href="http://propaccess.traviscad.org/clientdb/?cid=1">Click here to reload the property search to try again</a></b></p> 
</div> 
</div> 
</body> 
</html> 

我能夠通過同一臺計算機上的瀏覽器但訪問URL所以服務器絕對不會阻止我的IP。我不明白我的代碼有什麼問題?

回答

2

您需要先獲取一些cookie,然後才能訪問該網址。
雖然這可以用urllib2CookieJar做,我建議requests

import requests 
from BeautifulSoup import BeautifulSoup 

url1 = 'http://propaccess.traviscad.org/clientdb/?cid=1' 
url = 'http://propaccess.traviscad.org/clientdb/Property.aspx?prop_id=312669' 
ses = requests.Session() 
ses.get(url1) 
soup = BeautifulSoup(ses.get(url).content) 
print soup.prettify() 

注意requests是不是一個標準庫,你將不得不英索爾它。 如果你想使用urllib2

import urllib2 
from cookielib import CookieJar 

url1 = 'http://propaccess.traviscad.org/clientdb/?cid=1' 
url = 'http://propaccess.traviscad.org/clientdb/Property.aspx?prop_id=312669' 
cj = CookieJar() 
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj)) 
opener.open(url1) 
soup = BeautifulSoup(opener.open(url).read()) 
print soup.prettify() 
+0

'從BeautifulSoup進口BeautifulSoup'應該不會是'從BS4進口BeautifulSoup'? –

+1

@MD。 Khairul Basar是的,這就是我通常導入它的方式,但它可以工作。 –

+0

你爲什麼要導入cookies ..我嘗試過的其他例子從來不需要cookies – Zanam