2013-04-21 73 views
-1

這就是我所能夠管理的!我想獲得代理的如何從網站表格中的列中提取信息?

import urllib.request 

page = urllib.request.urlopen("http://www.samair.ru/proxy/ip-address-01.htm") 

page('\d+\.\d+\.\d+\.\d+') 
+0

http://docs.python.org/2/library/xml.dom.html – 2013-04-21 14:07:21

回答

6

在這種情況下,該表是不是一個真正的HTML表格,而不是純文本包裹在<pre></pre>。您可以通過查看頁面源來驗證它。 。不管怎麼樣,BeautifulSoup它在公園裏散步:

In [1]: from bs4 import BeautifulSoup 

In [2]: from urllib.request import urlopen 

In [3]: bs = BeautifulSoup(urlopen('http://www.samair.ru/proxy/ip-address-01.htm')) 

In [4]: print(bs.find('pre').text) 

IP address    Anonymity level Checked time  Country 
056.249.66.50:8080  transparent  Apr-21, 10:33  Bulgaria 
1.63.18.22:8080   transparent  Apr-21, 05:56  China 
1.9.75.8:8080   transparent  Apr-21, 12:58  Malaysia 
103.247.219.165:8080  transparent  Apr-21, 04:01  Indonesia 
103.4.165.190:80   transparent  Apr-21, 11:34  Indonesia 
103.9.126.110:8080  transparent  Apr-21, 12:19  Indonesia 
109.173.98.64:8080  transparent  Apr-20, 22:39  Russian Federation 
109.197.194.142:8080  transparent  Apr-21, 12:07  Russian Federation 
109.207.61.141:8090  transparent  Apr-21, 11:14  Poland 
109.207.61.145:8090  transparent  Apr-21, 13:04  Poland 
109.207.61.149:8090  transparent  Apr-21, 10:21  Poland 
109.207.61.165:8090  transparent  Apr-21, 03:57  Poland 
109.207.61.170:8090  transparent  Apr-21, 11:02  Poland 
109.207.61.208:8090  transparent  Apr-21, 10:45  Poland 
109.224.55.46:80   transparent  Apr-20, 21:50  Iraq 
109.227.124.105:8080  transparent  Apr-21, 09:57  Ukraine 
109.69.6.118:8080  transparent  Apr-21, 11:44  Albania 
110.138.248.135:8080  transparent  Apr-21, 09:10  Indonesia 
110.139.13.121:8080  transparent  Apr-21, 11:31  Indonesia 
110.159.179.108:80  transparent  Apr-20, 20:35  Malaysia 

In [5]: [l.split()[0] for l in bs.find('pre').text.split('\n')[1:]][1:] 
Out[5]: 
['056.249.66.50:8080', 
'1.63.18.22:8080', 
'1.9.75.8:8080', 
'103.247.219.165:8080', 
'103.4.165.190:80', 
'103.9.126.110:8080', 
'109.173.98.64:8080', 
'109.197.194.142:8080', 
'109.207.61.141:8090', 
'109.207.61.145:8090', 
'109.207.61.149:8090', 
'109.207.61.165:8090', 
'109.207.61.170:8090', 
'109.207.61.208:8090', 
'109.224.55.46:80', 
'109.227.124.105:8080', 
'109.69.6.118:8080', 
'110.138.248.135:8080', 
'110.139.13.121:8080', 
'110.159.179.108:80'] 
+0

回溯(最近通話最後一個): 文件「C :無模塊名爲 'BS4' – user1567728 2013-04-24 22:24:29

+0

什麼** **版本您使用的是 – user1567728 2013-04-24 22:30:14

+0

@ user1567728最新的一個:\ \特立\桌面\ ss.py」,1號線,在 從BS4進口BeautifulSoup 導入錯誤的用戶。但是任何版本都可以工作。按照[這裏](http://www.crummy.com/software/BeautifulSoup/#Download)的描述進行安裝。 – 2013-04-25 08:33:13