如何從一個html頁面

-1

NO庫中提取所有的鏈接名稱...如何從一個html頁面

我嘗試從一個網頁得到所有的鏈接標題，代碼如下

url="http://einstein.biz/" 
m = urllib.request.urlopen(url) 
msg = m.read() 
titleregex=re.compile('<a\s*href=[\'|"].*?[\'"].*?>(.+?)</a>') 
titles = titleregex.findall(str(msg)) 
print(titles)

的標題分別爲

['Photo Gallery', 'Bio', 'Quotes', 'Links', 'Contact', 'official store', '\\xe6\\x97\\xa5\\xe6\\x9c\\xac\\xe8\\xaa\\x9e', '<img\\n\\t\\tsrc="http://corbisrightsceleb.122.2O7.net/b/ss/corbisrightsceleb/1/H.14--NS/0"\\n\\t\\theight="1" width="1" border="0" alt="" />']

這是不理想的，我想只有如下：

['Photo Gallery', 'Bio', 'Quotes', 'Links', 'Contact', 'official store']

如何修改我的代碼？

來源

2014-10-31 3414314341

更換'（。+？）'與像您重新格局'（[\ W \ S] +）' – kums 2014-10-31 07:18:14

這是真的很難用正則表達式解析HTML代碼。正則表達式（尤其是python正則表達式）不喜歡嵌套結構。但[BeautifulSoup]（http://www.crummy.com/software/BeautifulSoup/）是一個很好的工具來解析HTML ... – 2014-10-31 07:19:41

強制性鏈接[爲什麼你不應該用正則表達式解析HTML]（http：// stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags） – miles82 2014-10-31 07:37:18

在處理HTML或XML文件，則必須使用BeautifulSoup。

>>> url="http://einstein.biz/" 
>>> import urllib.request 
>>> m = urllib.request.urlopen(url) 
>>> from bs4 import BeautifulSoup 
>>> soup = BeautifulSoup(m) 
>>> s = soup.find_all('a') 
>>> [i.string for i in s] 
['Photo Gallery', 'Bio', 'Quotes', 'Links', 'Contact', 'official store', '日本語', None]

更新：

>>> import urllib.request 
>>> url="http://einstein.biz/" 
>>> m = urllib.request.urlopen(url) 
>>> msg = m.read() 
>>> regex = re.compile(r'(?s)<a\s*href=[\'"].*?[\'"][^<>]*>([A-Za-z][^<>]*)</a>') 
>>> titles = regex.findall(str(msg)) 
>>> print(titles) 
['Photo Gallery', 'Bio', 'Quotes', 'Links', 'Contact', 'official store']

來源

2014-10-31 07:32:29

對不起，我忘記提及沒有庫 – 3414314341 2014-10-31 07:54:48

@ 3414314341一個標籤包含一些Unicode字符。你想要他們嗎？ – 2014-10-31 08:06:11

@ 3414314341看到我的更新.. – 2014-10-31 08:10:21

我一定會考慮BeautifulSoup作爲@serge提到。爲了讓它更具說服力，我已經包含了能夠完全滿足你需要的代碼。

from bs4 import BeautifulSoup 
soup = BeautifulSoup(msg)   #Feed BeautifulSoup your html. 
for link in soup.find_all('a'): #Look at all the 'a' tags. 
    print(link.string)    #Print out the descriptions.

回報

Photo Gallery 
Bio 
Quotes 
Links 
Contact 
official store

來源

2014-10-31 07:28:42 Rohit

我喜歡lxml.html比BeautifulSoup，即支持XPath和cssselector。

import requests 
import lxml.html 

res = requests.get("http://einstein.biz/") 
doc = lxml.html.fromstring(res.content) 
links = doc.cssselect("a") 
for l in links: 
    print l.text

來源

2014-10-31 08:02:49 shoma

如何從一個html頁面

回答

相關問題