涉及具有屬性的HTML標記的Python網絡抓取

我正在試圖製作一個網絡抓取工具，它將解析出版物的網頁並提取作者。該網頁的骨骼結構如下：涉及具有屬性的HTML標記的Python網絡抓取

<html> 
<body> 
<div id="container"> 
<div id="contents"> 
<table> 
<tbody> 
<tr> 
<td class="author">####I want whatever is located here ###</td> 
</tr> 
</tbody> 
</table> 
</div> 
</div> 
</body> 
</html>

我一直在嘗試使用BeautifulSoup和LXML迄今完成這一任務，但我不知道如何處理這兩個div標籤和td標籤，因爲它們具有屬性。除此之外，我不確定是否應該更多地依賴BeautifulSoup或lxml或兩者的組合。我該怎麼辦？

此刻，我的代碼看起來像下面的是：

import re 
    import urllib2,sys 
    import lxml 
    from lxml import etree 
    from lxml.html.soupparser import fromstring 
    from lxml.etree import tostring 
    from lxml.cssselect import CSSSelector 
    from BeautifulSoup import BeautifulSoup, NavigableString 

    address='http://www.example.com/' 
    html = urllib2.urlopen(address).read() 
    soup = BeautifulSoup(html) 
    html=soup.prettify() 
    html=html.replace('&nbsp', '&#160') 
    html=html.replace('&iacute','&#237') 
    root=fromstring(html)

我知道很多import語句可能是多餘的，但我只是複製任何我目前有更多的源文件。

編輯：我想，我沒有這樣做很清楚，但我有多個標籤頁，我想刮。

來源

2009-09-08 GobiasKoffi

這不是從你的問題很清楚，我爲什麼你需要擔心div標籤 - 這樣做正是：

soup = BeautifulSoup(html) 
thetd = soup.find('td', attrs={'class': 'author'}) 
print thetd.string

在你給的HTML，運行此發出完全相同：

####I want whatever is located here ###

這似乎是你想要的。也許你可以更精確地指定你需要什麼，這個超級簡單的代碼段不會做 - 多個td標籤，你需要考慮所有類（所有？只是一些？哪些？），可能缺少任何這樣的標籤（在這種情況下你想做什麼）等等。很難從這個簡單的例子和過多的代碼中推斷出你的規格到底是什麼;-)。

編輯：如果按業務方案的最新評論中，有多個這樣的TD標籤，每一個作者：

thetds = soup.findAll('td', attrs={'class': 'author'}) 
for thetd in thetds: 
    print thetd.string

...即，根本不更難 - ）

來源

2009-09-08 03:01:06

謝謝，亞歷克斯。我在頁面上有多個作者，所以我將擁有多個td標籤。我如何迭代它們中的每一個？ – GobiasKoffi 2009-09-08 03:21:42

BeautifulSoup肯定是規範的HTML解析器/處理器。但是，如果您只需要匹配這種類型的代碼段，而不是構建代表HTML的整個層次結構對象，則pyparsing可以輕鬆地定義前導和尾隨HTML標記，作爲創建更大搜索表達式的一部分：

from pyparsing import makeHTMLTags, withAttribute, SkipTo 

author_td, end_td = makeHTMLTags("td") 

# only interested in <td>'s where class="author" 
author_td.setParseAction(withAttribute(("class","author"))) 

search = author_td + SkipTo(end_td)("body") + end_td 

for match in search.searchString(html): 
    print match.body

Pyparsing的makeHTMLTags函數不僅僅發出"<tag>"和"</tag>"表達式。它還處理：按任意順序定義

屬性名命名空間

屬性值的標籤

"<tag/>"語法

零或開放標記的更多屬性

屬性

區分大小寫匹配單引號，雙引號或無引號
插入標記和符號之間的空白或屬性名稱'='，一個d值
屬性解析爲名爲結果

這些都是常見的陷阱，使用HTML刮的正則表達式當考慮後訪問。

來源

2009-09-08 03:31:52 PaulMcG

，或者您可以使用pyquery，因爲BeautifulSoup並不積極維護了，看到http://www.crummy.com/software/BeautifulSoup/3.1-problems.html

第一，與

easy_install pyquery

安裝pyquery那麼你的腳本可以簡單到

from pyquery import PyQuery 
d = PyQuery('http://mywebpage/') 
allauthors = [ td.text() for td in d('td.author') ]

pyquery使用jQuery熟悉的css選擇器語法，我發現它比BeautifulSoup更直觀。它在下面使用lxml，比BeautifulSoup快得多。但是BeautifulSoup是純Python，因此也可以在Google的應用引擎上工作。

來源

2010-05-02 07:01:44 captnswing

lxml庫現在是用於在python中解析html的標準。界面起初看起來很尷尬，但它對於它的功能非常有用。

您應該讓libary處理xml專業知識，例如逃脫的＆實體;

import lxml.html 

html = """<html><body><div id="container"><div id="contents"><table><tbody><tr> 
      <td class="author">####I want whatever is located here, eh? &iacute; ###</td> 
      </tr></tbody></table></div></div></body></html>""" 

root = lxml.html.fromstring(html) 
tds = root.cssselect("div#contents td.author") 

print tds   # gives [<Element td at 84ee2cc>] 
print tds[0].text # what you want, including the 'í'

來源

2011-05-04 10:51:34

涉及具有屬性的HTML標記的Python網絡抓取

回答

相關問題