2012-09-06 60 views
8

我無法使用urllib2打開一個特定的網址。同樣的方法適用於其他網站,如「http://www.google.com」,但不適用於此網站(該網站在瀏覽器中也顯示正常)。urllib2返回404爲瀏覽器顯示罰款的網站

我簡單的代碼:

from BeautifulSoup import BeautifulSoup 
import urllib2 

url="http://www.experts.scival.com/einstein/" 
response=urllib2.urlopen(url) 
html=response.read() 
soup=BeautifulSoup(html) 
print soup 

誰能幫我做工作?

這是錯誤我:

Traceback (most recent call last): 
    File "/Users/jontaotao/Documents/workspace/MedicalSchoolInfo/src/AlbertEinsteinCollegeOfMedicine_SciValExperts/getlink.py", line 12, in <module> 
    response=urllib2.urlopen(url); 
    File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 126, in urlopen 
    return _opener.open(url, data, timeout) 
    File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 400, in open 
    response = meth(req, response) 
    File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 513, in http_response 
    'http', request, response, code, msg, hdrs) 
    File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 432, in error 
    result = self._call_chain(*args) 
    File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 372, in _call_chain 
    result = func(*args) 
    File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 619, in http_error_302 
    return self.parent.open(new, timeout=req.timeout) 
    File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 400, in open 
    response = meth(req, response) 
    File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 513, in http_response 
    'http', request, response, code, msg, hdrs) 
    File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 438, in error 
    return self._call_chain(*args) 
    File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 372, in _call_chain 
    result = func(*args) 
    File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 521, in http_error_default 
    raise HTTPError(req.get_full_url(), code, msg, hdrs, fp) 
urllib2.HTTPError: HTTP Error 404: Not Found 

謝謝

+1

什麼是你的錯誤? –

+3

停止在行尾添加分號。這是Python。 – FogleBird

+0

我的錯是關於獲取參數,但我認爲不是你的問題 –

回答

8

我只是嘗試這樣做,並獲得404碼和頁面回。

猜測它正在做用戶代理檢測,無論是意外還是故意不向python urllib提供內容。

澄清,與urllib,我收到urlopen返回與404代碼和HTML內容的響應對象。 urllib2.urlopenurllib2.HTTPError異常被提出。

我建議您嘗試將您的用戶代理設置爲看起來像瀏覽器的東西。這裏有一個關於這個問題:Changing user agent on urllib2.urlopen

+0

這也是我的猜測,你打敗了我。 – FogleBird

0

hm ...你確定這個URL是有效的嗎?嘗試「http://www.google.com」我有類似的代碼,並沒有與urllib問題。或者你可以使用try - except語句來查看錯誤的細節。當然MattH的答案是非常相似的真理:)

3

您可以使用try except捕獲錯誤

try: 
    u = urllib2.urlopen(req) 
except urllib2.HTTPError, e: 
    print e.code 
    print e.msg 
    return 
相關問題