2017-08-09 91 views
2

我有以下HTMLPython美麗的湯:如何提取標籤旁邊的文字?

<p> 
<b>Father:</b> Michael Haughton 
<br> 
<b>Mother:</b> Diane 
<br><b>Brother:</b> 
Rashad Haughton<br> 
<b>Husband:</b> <a href="/people/540/000024468/">R. Kelly</a> (m. 1994, annulled that same year) 
<br><b>Boyfriend:</b> <a href="/people/420/000109093/">Damon Dash</a> (Roc-a-Fella co-CEO)<br></p> 

我不得不單獨標題和文字,例如,母親黛安 ..

所以在最後我會作爲字典的名單:

[{"label":"Mother","value":"Diane"}] 

我試着以下但不工作:

def parse(u): 
    u = u.rstrip('\n') 
    r = requests.get(u, headers=headers) 
    if r.status_code == 200: 
     html = r.text.strip() 
     soup = BeautifulSoup(html, 'lxml') 
     headings = soup.select('table p') 
     for h in headings: 
      b = h.find('b') 
      if b is not None: 
       print(b.text) 
       print(h.text + '\n') 
       print('=================================') 


url = 'http://www.nndb.com/people/742/000024670/' 

回答

1
from bs4 import BeautifulSoup 
from urllib.request import urlopen 

#html = '''<p> 
#<b>Father:</b> Michael Haughton 
#<br> 
#<b>Mother:</b> Diane 
#<br><b>Brother:</b> 
#Rashad Haughton<br> 
#<b>Husband:</b> <a href="/people/540/000024468/">R. Kelly</a> (m. 1994, annulled that same year) 
#<br><b>Boyfriend:</b> <a href="/people/420/000109093/">Damon Dash</a> (Roc-a-Fella co-CEO)<br></p>''' 

page = urlopen('http://www.nndb.com/people/742/000024670/') 
source = page.read() 

soup = BeautifulSoup(source) 

needed_p = soup.find_all('p')[8] 

bs = needed_p.find_all('b') 

res = {} 

for b in bs: 
    if b.find_next('a').text: 
     res[b.text] = b.find_next('a').text.strip().strip('\n') 
    if b.next_sibling != ' ': 
     res[b.text] = b.next_sibling.strip().strip('\n') 

res 

輸出:

{'Brother:': 'Rashad Haughton', 
'Mother:': 'Diane', 
'Husband:': 'R. Kelly', 
'Father:': 'Michael Haughton', 
'Boyfriend:': 'Damon Dash'} 

編輯: 有關頁面頂部的附加信息:

... (code above) ... 
soup = BeautifulSoup(source) 

needed_p = soup.find_all('p')[1:4] + [soup.find_all('p')[8]] # here explicitly selecting needed p-tags for further parsing 

res = {} 

for p in needed_p: 
    bs = p.find_all('b') 
    for b in bs: 
     if b.find_next('a').text: 
      res[b.text] = b.find_next('a').text.strip().strip('\n') 
     if b.next_sibling != ' ': 
      res[b.text] = b.next_sibling.strip().strip('\n') 

res 

輸出:

{'Race or Ethnicity:': 'Black', 
'Husband:': 'R. Kelly', 
'Died:': '25-Aug', 
'Nationality:': 'United States', 
'Executive summary:': 'R&B singer, died in plane crash', 
'Mother:': 'Diane', 
'Birthplace:': 'Brooklyn, NY', 
'Born:': '16-Jan', 
'Boyfriend:': 'Damon Dash', 
'Sexual orientation:': 'Straight', 
'Occupation:': 'Singer', 
'Cause of death:': 'Accident - Airplane', 
'Brother:': 'Rashad Haughton', 
'Remains:': 'Interred,', 
'Gender:': 'Female', 
'Father:': 'Michael Haughton', 
'Location of death:': 'Marsh Harbour, Abaco Island, Bahamas'} 

對於precisel Y本頁面,您還可以湊高中,例如,像這樣:

res['High School'] = soup.find_all('p')[9].text.split(':')[1].strip() 
+0

你介意解釋你的代碼嗎? –

+0

@Rightleg,你不明白的是什麼? –

+0

@DmitriyFialkovskiy對URL運行時,它會給出錯誤: 'res [b.text] = b.next_sibling.strip(url ='http://www.nndb.com/ people/742/000024670 /'' ).strip('\ n') AttributeError:'NoneType'對象沒有屬性'strip'' – Volatil3

0

您正在尋找next_sibling標籤屬性。 這可以爲您提供下一個NavigableString或下一個Tag,具體取決於它先找到的內容。

這裏是你如何使用它:

html = """..."""    
soup = BeautifulSoup(html) 

bTags = soup.find_all('b') 
for it_tag in bTags: 
    print(it_tag.string) 
    print(it_tag.next_sibling) 

輸出:

Father: 
Michael Haughton 

Mother: 
Diane 

Brother: 

Rashad Haughton 
Husband: 

Boyfriend: 

這似乎有點過。 部分原因是由於換行符和空格,您可以使用str.strip方法輕鬆刪除它。

仍然,BoyfriendHusband條目缺乏價值。 這是因爲next_siblingNavigableString(即str)或Tag。 的<b>標籤和標籤<a>這裏被解釋爲一個非空的文本之間的空白:

<b>Boyfriend:</b> <a href="/people/420/000109093/">Damon Dash</a> 
       ^

如果缺席,<b>Boyfriend:</b>的下一個兄弟會的<a>標籤。 既然它存在,你必須檢查:

  • 是否下一個兄弟是一個字符串或標籤;
  • 如果它是一個字符串,它是否只包含空格。

如果一個兄弟是唯一的空白字符串,那麼你正在尋找的信息是NavigableString的下一個兄弟,這將是一個<a>標籤。

編輯的代碼:

bTags = soup.find_all('b') 

for it_tag in bTags: 
    print(it_tag.string) 

    nextSibling = it_tag.next_sibling 
    if nextSibling is not None: 
     if isinstance(nextSibling, str): 
      if nextSibling.isspace(): 
       print(it_tag.next_sibling.next_sibling.string.strip()) 
      else: 
       print(nextSibling.strip()) 

     elif isinstance(it_tag.next_sibling, bs4.Tag): 
      print(it_tag.next_sibling.string) 

輸出:

Father: 
Michael Haughton 
Mother: 
Diane 
Brother: 
Rashad Haughton 
Husband: 
R. Kelly 
Boyfriend: 
Damon Dash 

現在你可以很容易地建立自己的詞典:

entries = {} 
bTags = soup.find_all('b') 

for it_tag in bTags: 
    key = it_tag.string.replace(':', '') 
    value = None 

    nextSibling = it_tag.next_sibling 
    if nextSibling is not None: 
     if isinstance(nextSibling, str): 
      if nextSibling.isspace(): 
       value = it_tag.next_sibling.next_sibling.string.strip() 
      else: 
       value = nextSibling.strip() 

     elif isinstance(it_tag.next_sibling, bs4.Tag): 
      value = it_tag.next_sibling.string 

    entries[key] = value 

輸出詞典:

{'Father': 'Michael Haughton', 
'Mother': 'Diane', 
'Brother': 'Rashad Haughton', 
'Husband': 'R. Kelly', 
'Boyfriend': 'Damon Dash'} 
+0

我得到的錯誤'27行,在解析 如果it_tag.next_sibling.isspace(): AttributeError的: 'NoneType' 對象沒有屬性'isspace' – Volatil3

+0

@ Volatil3我編輯了我的代碼。請檢查它是否適用於您。我添加了一個「無」檢查,我壓縮了測試。 –

+0

'key = it_tag.string.replace(':','') AttributeError:'NoneType'對象沒有屬性'replace'' – Volatil3