0
我試圖在聯機書籤服務上從我的帳戶中抓取數據。帶有書籤的頁面組織如下:Python網絡抓取:使用多個標籤提取一個屬性
<!DOCTYPE html>
<html lang="en">
<body>
<div id="item1" class="outer_block">
<div class="title">Bookmark 1</div>
<div class="link">
<a href="https://bookmark1.com">https://bookmark1.com</a>
</div>
<div class="tags">
<a href="http://mylink.com/tag1">tag1</a>
<a href="http://mylink.com/tag2">tag2</a>
</div>
</div>
<div id="item2" class="outer_block">
<div class="title">Bookmark 2</div>
<div class="link">
<a href="https://bookmark2.com">https://bookmark2.com</a>
</div>
<div class="tags">
<a href="http://mylink.com/tag1">tag1</a>
</div>
</div>
<div id="item3" class="outer_block">
<div class="title">Bookmark 3</div>
<div class="link">
<a href="https://bookmark3.com">https://bookmark3.com</a>
</div>
<div class="tags">
<a href="http://mylink.com/tag3">tag3</a>
</div>
</div>
</body>
</html>
對於每個塊我想提取標題,鏈接和標籤。 在Python 3.5,我做的:
# Import modules
import requests
from lxml import html
# Read the html
# url = 'mylink'
# page = requests.get(url)
# tree = html.fromstring(page.content)
# This is the replicable example
tree = html.fromstring('<!DOCTYPE html><html lang="en"><body><div id="item1" class="outer_block"> <div class="title">Item 1</div> <div class="link"> <a href="https://bookmark1.com">https://bookmark1.com</a> </div> <div class="tags"> <a href="http://mylink.com/tag1">tag1</a> <a href="http://mylink.com/tag2">tag2</a> </div></div><div id="item2" class="outer_block"> <div class="title">Item 2</div> <div class="link"> <a href="https://bookmark2.com">https://bookmark2.com</a> </div> <div class="tags"> <a href="http://mylink.com/tag1">tag1</a> </div></div><div id="item3" class="outer_block"> <div class="title">Item 3</div> <div class="link"> <a href="https://bookmark3.com">https://bookmark3.com</a> </div> <div class="tags"> <a href="http://mylink.com/tag3">tag3</a> </div></div></body></html>')
我使用xpath
提取字符串的圖案,說標題:
titles = tree.xpath('//div[@class="title"]/text()')
print(titles)
[ '書籤1', '書籤2', '書籤3' ]
爲了提取標籤,我使用相同的原理:
tags = tree.xpath('//div[@class="tags"]//a/text()')
print(tags)
[」 tag1','tag2','tag1','tag3']
問題是每個鏈接都有各種標籤,所以我不能將數組titles
與數組tags
關聯。 我以爲我可以提取每個塊,然後對他們的獨立工作:
blocks = tree.xpath('//div[@class="outer_block"]')
block1 = blocks[0]
我不明白的是,當我從block1
提取標籤,它仍然保持所有原始的HTML的標籤。
tags_block1 = block1.xpath('//div[@class="tags"]//a/text()'
print(tags_block1)
[「TAG1」,「標籤2」,「標籤1」,「標籤3」]
我如何提取標題及相應的標籤,什麼是最好的輸出格式,並沒有任何其他可以更輕鬆地完成這項工作的軟件包?