Python的正則表達式問題

我試圖從site使用python通過使用urlib掃描頁面和使用正則表達式查找代理來獲取代理。Python的正則表達式問題

頁面上的代理服務器看起來是這樣的：

<a href="/ip/190.207.169.184/free_Venezuela_proxy_servers_VE_Venezuela">190.207.169.184</a></td><td>8080</td><td>

我的代碼如下所示：

for site in sites: 
content = urllib.urlopen(site).read() 
e = re.findall("\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\<\/\a\>\<\/td\>\<td\>\d+", content) 
#\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}:\d+ 

for proxy in e: 
    s.append(proxy) 
    amount += 1

正則表達式：

\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\<\/\a\>\<\/td\>\<td\>\d+

我知道代碼工作，但正則表達式是錯誤的。

有關如何解決此問題的任何想法？編輯：http://www.regexr.com/似乎事情我的正則表達式很好？

來源

2014-10-04 Cephon

直視'lxml'或'beautifulsoup'。使用正則表達式的HTML是一個黑客攻擊。 – 2014-10-04 18:08:18

請勿轉義'<,>，a，http://regex101.com/r/xB5sT0/2 – 2014-10-04 18:10:17

請參閱http://stackoverflow.com/questions/26183643/find-specific-text-in-beautifulsoup/ 26183877＃26183877 – 2014-10-04 18:11:55

一種選擇是使用HTML解析器來查找IP地址和端口。

例子（使用BeautifulSoup HTML解析器）：

import re 
import urllib2 
from bs4 import BeautifulSoup 

data = urllib2.urlopen('http://letushide.com/protocol/http/3/list_of_free_HTTP_proxy_servers') 

IP_RE = re.compile(r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}') 
PORT_RE = re.compile(r'\d+') 

soup = BeautifulSoup(data) 
for ip in soup.find_all('a', text=IP_RE): 
    port = ip.parent.find_next_sibling('td', text=PORT_RE) 
    print ip.text, port.text

打印：

80.193.214.231 3128 
186.88.37.204 8080 
180.254.72.33 80 
201.209.27.119 8080 
...

這裏的想法是要找到所有a標籤匹配\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}正則表達式的文本。對於每個鏈接，請找到父母的下一個td兄弟，並使用與\d+匹配的文本。

或者，因爲你知道的表結構和那裏有IP地址和端口列，你可以直接從各行按索引獲取單元格的值，無需潛入正則表達式在這裏：

import urllib2 
from bs4 import BeautifulSoup 

data = urllib2.urlopen('http://letushide.com/protocol/http/3/list_of_free_HTTP_proxy_servers') 

soup = BeautifulSoup(data) 
for row in soup.find_all('tr', id='data'): 
    print [cell.text for cell in row('td')[1:3]]

打印：

[u'80.193.214.231', u'3128'] 
[u'186.88.37.204', u'8080'] 
[u'180.254.72.33', u'80'] 
[u'201.209.27.119', u'8080'] 
[u'190.204.96.72', u'8080'] 
[u'190.207.169.184', u'8080'] 
[u'79.172.242.188', u'8080'] 
[u'1.168.171.100', u'8088'] 
[u'27.105.26.162', u'9064'] 
[u'190.199.92.174', u'8080'] 
...

來源

2014-10-04 18:17:56 alecxe

Python的正則表達式問題

回答

相關問題