lxml：分割屬性？

我使用LXML刮一些HTML，看起來像這樣：lxml：分割屬性？

<div align=center><a style="font-size: 1.1em">Football</a></div> 
<a href="">Team A</a> 
<a href="">Team B</a> 
<div align=center><a style="font-size: 1.1em">Baseball</a></div> 
<a href="">Team C</a> 
<a href="">Team D</a>

我如何能在形式

[ {'category': 'Football', 'title': 'Team A'}, 
{'category': 'Football', 'title': 'Team B'}, 
{'category': 'Baseball', 'title': 'Team C'}, 
{'category': 'Baseball', 'title': 'Team D'}]

到目前爲止，我已經得到了最終的數據：

results = [] 
for (i,a) in enumerate(content[0].xpath('./a')): 
    data['text'] = a.text 
    results.append(data)

但我不知道如何通過拆分font-size和保留兄弟標籤得到類別名稱 - 任何建議？

謝謝！

來源

2011-06-13 Richard

我不確定你錯過了哪些數據 - 結果對我來說似乎沒問題。 – miku 2011-06-13 12:46:38

它缺少類別 - 足球或棒球。 – Richard 2011-06-13 12:49:16

對不起，錯過了* * * *我怎麼能*以*形式結束數據... ... – miku 2011-06-13 12:50:29

我用下面的代碼的成功：

#!/usr/bin/env python 

snippet = """ 
<html><head></head><body> 
<div align=center><a style="font-size: 1.1em">Football</a></div> 
<a href="">Team A</a> 
<a href="">Team B</a> 
<div align=center><a style="font-size: 1.1em">Baseball</a></div> 
<a href="">Team C</a> 
<a href="">Team D</a> 
</body></html> 
""" 

import lxml.html 

html = lxml.html.fromstring(snippet) 
body = html[1] 

results = [] 
current_category = None 

for element in body.xpath('./*'): 
    if element.tag == 'div': 
     current_category = element.xpath('./a')[0].text 
    elif element.tag == 'a': 
     results.append({ 'category' : current_category, 
      'title' : element.text }) 

print results

它會打印：

[{'category': 'Football', 'title': 'Team A'}, 
{'category': 'Football', 'title': 'Team B'}, 
{'category': 'Baseball', 'title': 'Team C'}, 
{'category': 'Baseball', 'title': 'Team D'}]

刮痧是脆弱的。在這裏例如，我們明確地依賴於元素的排序以及嵌套。但是，有時候這種硬連線的方法可能會足夠好。

這裏是另一個（更加面向XPath的方式）使用preceding-sibling軸：

#!/usr/bin/env python 

snippet = """ 
<html><head></head><body> 
<div align=center><a style="font-size: 1.1em">Football</a></div> 
<a href="">Team A</a> 
<a href="">Team B</a> 
<div align=center><a style="font-size: 1.1em">Baseball</a></div> 
<a href="">Team C</a> 
<a href="">Team D</a> 
</body></html> 
""" 

import lxml.html 

html = lxml.html.fromstring(snippet) 
body = html[1] 

results = [] 

for e in body.xpath('./a'): 
    results.append(dict(
     category=e.xpath('preceding-sibling::div/a')[-1].text, 
     title=e.text)) 

print results

來源

2011-06-13 13:04:25 miku

天才，謝謝。是的，在我的實際網頁上，「兄弟姐妹」效果更好！ – Richard 2011-06-13 15:52:19

我現在意識到我的錯誤：試圖從lxml文檔中找出該做什麼，而不是xpath文檔！ – Richard 2011-06-13 15:52:41

此外，如果你正在尋找其他方式（只是一個選項 - 不要打我太多）如何做到這一點，或者你沒有導入lxml，您可以使用下面的怪異代碼的能力：

text = """ 
      <a href="">Team YYY</a> 
      <div align=center><a style="font-size: 1.1em">Polo</a></div> 
      <div align=center><a style="font-size: 1.1em">Football</a></div> 
      <a href="">Team A</a> 
      <a href="">Team B</a> 
      <div align=center><a style="font-size: 1.1em">Baseball</a></div> 
      <a href="">Team C</a> 
      <a href="">Team D</a> 
      <a href="">Team X</a> 
      <div align=center><a style="font-size: 1.1em">Tennis</a></div> 
     """ 
# next variables could be modified depending on what you really need   
keyStartsWith = '<div align=center><a style="font-size: 1.1em">' 
categoryStart = len(keyStartsWith) 
categoryEnd = -len('</a></div>') 
output = [] 
data = text.split('\n')  
titleStart = len('<a href="">') 
titleEnd = -len('</a>') 

getdict = lambda category, title: {'category': category, 'title': title} 

# main loop 
for i, line in enumerate(data): 
    line = line.strip() 
    if keyStartsWith in line and len(data)-1 >= i+1: 
     category = line[categoryStart: categoryEnd] 
     (len(data)-1 == i and output.append(getdict(category, ''))) 
     if i+1 < len(data)-1 and keyStartsWith in data[i+1]: 
      output.append(getdict(category, '')) 
     else: 
      while i+1 < len(data)-1 and keyStartsWith not in data[i+1]: 
       title = data[i+1].strip()[titleStart: titleEnd] 
       output.append(getdict(category, title)) 
       i += 1

來源

2011-06-13 14:00:28

沒有冒犯 - 這可能是正確的，但它太複雜了。 – miku 2011-06-13 14:02:55

@miku - 是的，我知道，您的解決方案更簡單 - 這就是爲什麼我投票支持它，我只是把我的解決方案放在這裏，就像那些因爲任何當地原因無法使用您的解決方案的選項。 – 2011-06-13 14:07:30

當然，我不會downvote。但一般情況下，如果您嘗試類似解析HTML，您應該使用專用庫 - 人們甚至試圖用正則表達式解析HTML，然後發生有趣的事情 - 請參閱： http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454＃1732454 – miku 2011-06-13 14:10:55

lxml：分割屬性？

回答

相關問題