2015-08-16 94 views
3

工作HTML:Python的 - 如何將多個標籤之間提取元素

<h2> Heading 1 </h2> 
<h3> Subheading 1.1 </h3> 
<a href="#">Link 1</a> | <a href="#">Link 2 </a> | <a href="#">Link 3</a> 
<h3> Subheading 1.2 </h3> 
<a href="#">Link 1</a> | <a href="#">Link 2 </a> | <a href="#">Link 3</a> | <a href="#">Link 4</a> 
<h3> Subheading 1.3 </h3> 
<a href="#">Link 1</a> 
<h2> Heading 2 </h2> 
<h3> Subheading 2.1 </h3> 
<a href="#">Link 1</a> | <a href="#">Link 2</a> 
<h3> Subheading 2.2 </h3> 
<a href="#">Link 1</a> | <a href="#">Link 2 </a> 
<h3> Subheading 2.3 </h3> 
<a href="#">Link 1</a> 
<h2> Heading 3 </h2> 

問題: 我想每一個h2標籤之間抽取h3標籤,並提取所有標籤anchorsh3之間

我有什麼:

soup = BeautifulSoup("""<h2> Heading 1 </h2> 
<h3> Subheading 1.1 </h3> 
<a href="#">Link 1</a> | <a href="#">Link 2 </a> | <a href="#">Link 3</a> 
<h3> Subheading 1.2 </h3> 
<a href="#">Link 1</a> | <a href="#">Link 2 </a> | <a href="#">Link 3</a> | <a href="#">Link 4</a> 
<h3> Subheading 1.3 </h3> 
<a href="#">Link 1</a> 
<h2> Heading 2 </h2> 
<h3> Subheading 2.1 </h3> 
<a href="#">Link 1</a> | <a href="#">Link 2</a> 
<h3> Subheading 2.2 </h3> 
<a href="#">Link 1</a> | <a href="#">Link 2 </a> 
<h3> Subheading 2.3 </h3> 
<a href="#">Link 1</a> 
<h2> Heading 3 </h2>""", 'html5lib') 

for row in soup.find_all("h2"): 
    print(row.text) 
    print(row.find_next('h3')) 
    print('################') 

當前的結果:

################ 
Heading 1 
<h3> Subheading 1.1 </h3> 
################ 
Heading 2 
<h3> Subheading 2.1 </h3> 
################ 
Heading 3 
None 
################ 

通緝的結果:

################ 
Heading 1 
Subheading 1.1 
Link 1 
Link 2 
Link 3 
-------- 
Subheading 1.2 
Link 1 
Link 2 
Link 3 
Link 4 
-------- 
Subheading 1.3 
Link 1 
################ 
Heading 2 
Subheading 2.1 
Link 1 
Link 2 
-------- 
Subheading 2.2 
Link 1 
Link 2 
-------- 
Subheading 2.3 
Link 1 
################ 

或者類似的東西

回答

2

這工作!

s = """ 

<h2> Heading 1 </h2> 
<h3> Subheading 1.1 </h3> 
<a href="#">Link 1</a> | <a href="#">Link 2 </a> | <a href="#">Link 3</a> 
<h3> Subheading 1.2 </h3> 
<a href="#">Link 1</a> | <a href="#">Link 2 </a> | <a href="#">Link 3</a> | <a href="#">Link 4</a> 
<h3> Subheading 1.3 </h3> 
<a href="#">Link 1</a> 
<h2> Heading 2 </h2> 
<h3> Subheading 2.1 </h3> 
<a href="#">Link 1</a> | <a href="#">Link 2</a> 
<h3> Subheading 2.2 </h3> 
<a href="#">Link 1</a> | <a href="#">Link 2 </a> 
<h3> Subheading 2.3 </h3> 
<a href="#">Link 1</a> 
<h2> Heading 3 </h2> 

""" 

from bs4 import BeautifulSoup as bs 

soup = bs(s) 

for i in soup.find_all('h2'): 
    print i.text 
    for j in i.next_siblings: 
     if j.name == 'h2': break 
     if j.name == 'h3': 
      print '\t'+j.text 
      for k in j.next_siblings: 
       if k.name == 'h3': break 
       if k.name == 'a': 
        print '\t\t'+k.text 
相關問題