2016-12-04 79 views
2

我正試圖從2012年奧巴馬 - 羅姆尼總統的辯論中摘錄報價。問題是the site組織不良。因此,結構是這樣的:如何使用BeautifulSoup根據孩子和兄弟姐妹選擇標籤?

<span class="displaytext"> 
    <p> 
     <i>OBAMA</i>Obama's first quotes 
    </p> 
    <p>More quotes from Obama</p> 
    <p>Some more Obama quotes</p> 

    <p> 
     <i>Moderator</i>Moderator's quotes 
    </p> 
    <p>Some more quotes</p> 

    <p> 
     <i>ROMNEY</i>Romney's quotes 
    </p> 
    <p>More quotes from Romney</p> 
    <p>Some more Romney quotes</p> 
</span> 

有沒有一種方法來選擇<p>,其第一個孩子是一個i具有文本OBAMA和所有它的p兄弟姐妹,直到你遇到下一個p他們的第一個孩子是一個i沒有文字Obama ??

這裏是我試過到目前爲止,但它僅抓住了第一個p無視兄弟姐妹

input = '''<span class="displaytext"> 
     <p> 
      <i>OBAMA</i>Obama's first quotes 
     </p> 
     <p>More quotes from Obama</p> 
     <p>Some more Obama quotes</p> 

     <p> 
      <i>Moderator</i>Moderator's quotes 
     </p> 
     <p>Some more quotes</p> 

     <p> 
      <i>ROMNEY</i>Romney's quotes 
     </p> 
     <p>More quotes from Romney</p> 
     <p>Some more Romney quotes</p> 
     </span>''' 

soup = BeautifulSoup(input) 
debate_text = soup.find("span", { "class" : "displaytext" }) 
president_quotes = debate_text.find_all("i", text="OBAMA") 

for i in president_quotes: 
    siblings = i.next_siblings 
    for sibling in siblings: 
     print(sibling) 

其中僅打印Obama's first quotes

回答

2

我覺得有種finite state machine式的解決方案將在這裏工作。就像這樣:

soup = BeautifulSoup(input, 'lxml') 
debate_text = soup.find("span", { "class" : "displaytext" }) 
obama_is_on = False 
obama_tags = [] 
for p in debate_text("p"): 
    if p.i and 'OBAMA' in p.i: 
     # assuming <i> is used only to indicate speaker 
     obama_is_on = True 
    if p.i and 'OBAMA' not in p.i: 
     obama_is_on = False 
     continue 
    if obama_is_on: 
     obama_tags.append(p) 
print(obama_tags) 

[<p> 
<i>OBAMA</i>Obama's first quotes 
     </p>, <p>More quotes from Obama</p>, <p>Some more Obama quotes</p>] 
2

其他奧巴馬引號是p,而不是i的兄弟姐妹,所以你需要找到i的父母的兄弟姐妹。當你通過這些兄弟姐妹循環時,你可以停止當有一個i。事情是這樣的:

for i in president_quotes: 
    print(i.next_sibling) 
    siblings = i.parent.find_next_siblings('p') 
    for sibling in siblings: 
     if sibling.find("i"): 
      break 
     print(sibling.string) 

它打印:

Obama's first quotes 

More quotes from Obama 
Some more Obama quotes