2016-12-29 74 views
1

試圖讓這片代碼工作:(網頁抓取使用BeautifulSoup樣品)網頁抓取使用BeauitifulSoup錯誤:[錯誤10061]

import urllib2  
wiki = "https://en.wikipedia.org/wiki/List_of_state_and_union_territory_capitals_in_India" 
page = urllib2.urlopen(wiki) 
from bs4 import BeautifulSoup 
soup = BeautifulSoup(page) 

我得到這個錯誤: -

URLError: <urlopen error [Errno 10061] No connection could be made because the target machine actively refused it> 

我猜測這是與一些防火牆/安全相關的問題,有人可以幫助應該做什麼?

+0

我認爲,你需要設置代理 –

+0

結帳http://stackoverflow.com/questions/1450132/proxy-with-urllib2 –

+0

請求更好地使用 –

回答

1

你可以嘗試這樣的事情與requests

import requests 
from bs4 import BeautifulSoup 

wiki = "https://en.wikipedia.org/wiki/List_of_state_and_union_territory_capitals_in_India" 
page = requests.get(wiki).content 
soup = BeautifulSoup(page) 

如果你想拿到桌子,你可以使用熊貓這樣的:

import pandas as pd 

wiki = "https://en.wikipedia.org/wiki/List_of_state_and_union_territory_capitals_in_India" 
df = pd.read_html(wiki)[1] 
df2 = df.copy() 
df2.columns = df.iloc[0] 
df2.drop(0, inplace=True) 
df2.drop('No.', axis=1, inplace=True) 
df2.head() 

輸出:

enter image description here

+0

我最終發生同樣的錯誤: - ConnectionError:HTTPSConnectionPool(host ='en.wikipedia.org',port = 443) :最大重試次數超過url:/ wiki/List_of_state_and_union_territory_capitals_in_India(由NewConnectionError引起(':無法建立新連接:[Errno 10061]由於目標機器積極拒絕它'))----當我嘗試第一個片段時。 – Indi

+0

@Indi我必須用代理來做一些事情。閱讀此:https://github.com/kennethreitz/requests/issues/2875 – MYGz

+0

同樣的錯誤,無論我嘗試 – Indi