Python美麗的湯：如何提取標籤旁邊的文字？

我有以下HTMLPython美麗的湯：如何提取標籤旁邊的文字？

<p> 
<b>Father:</b> Michael Haughton 
<br> 
<b>Mother:</b> Diane 
<br><b>Brother:</b> 
Rashad Haughton<br> 
<b>Husband:</b> <a href="/people/540/000024468/">R. Kelly</a> (m. 1994, annulled that same year) 
<br><b>Boyfriend:</b> <a href="/people/420/000109093/">Damon Dash</a> (Roc-a-Fella co-CEO)<br></p>

我不得不單獨標題和文字，例如，母親：黛安 ..

所以在最後我會作爲字典的名單：

[{"label":"Mother","value":"Diane"}]

我試着以下但不工作：

def parse(u): 
    u = u.rstrip('\n') 
    r = requests.get(u, headers=headers) 
    if r.status_code == 200: 
     html = r.text.strip() 
     soup = BeautifulSoup(html, 'lxml') 
     headings = soup.select('table p') 
     for h in headings: 
      b = h.find('b') 
      if b is not None: 
       print(b.text) 
       print(h.text + '\n') 
       print('=================================') 


url = 'http://www.nndb.com/people/742/000024670/'

來源

2017-08-09 Volatil3

from bs4 import BeautifulSoup 
from urllib.request import urlopen 

#html = '''<p> 
#<b>Father:</b> Michael Haughton 
#<br> 
#<b>Mother:</b> Diane 
#<br><b>Brother:</b> 
#Rashad Haughton<br> 
#<b>Husband:</b> <a href="/people/540/000024468/">R. Kelly</a> (m. 1994, annulled that same year) 
#<br><b>Boyfriend:</b> <a href="/people/420/000109093/">Damon Dash</a> (Roc-a-Fella co-CEO)<br></p>''' 

page = urlopen('http://www.nndb.com/people/742/000024670/') 
source = page.read() 

soup = BeautifulSoup(source) 

needed_p = soup.find_all('p')[8] 

bs = needed_p.find_all('b') 

res = {} 

for b in bs: 
    if b.find_next('a').text: 
     res[b.text] = b.find_next('a').text.strip().strip('\n') 
    if b.next_sibling != ' ': 
     res[b.text] = b.next_sibling.strip().strip('\n') 

res

輸出：

{'Brother:': 'Rashad Haughton', 
'Mother:': 'Diane', 
'Husband:': 'R. Kelly', 
'Father:': 'Michael Haughton', 
'Boyfriend:': 'Damon Dash'}

編輯：有關頁面頂部的附加信息：

... (code above) ... 
soup = BeautifulSoup(source) 

needed_p = soup.find_all('p')[1:4] + [soup.find_all('p')[8]] # here explicitly selecting needed p-tags for further parsing 

res = {} 

for p in needed_p: 
    bs = p.find_all('b') 
    for b in bs: 
     if b.find_next('a').text: 
      res[b.text] = b.find_next('a').text.strip().strip('\n') 
     if b.next_sibling != ' ': 
      res[b.text] = b.next_sibling.strip().strip('\n') 

res

輸出：

{'Race or Ethnicity:': 'Black', 
'Husband:': 'R. Kelly', 
'Died:': '25-Aug', 
'Nationality:': 'United States', 
'Executive summary:': 'R&B singer, died in plane crash', 
'Mother:': 'Diane', 
'Birthplace:': 'Brooklyn, NY', 
'Born:': '16-Jan', 
'Boyfriend:': 'Damon Dash', 
'Sexual orientation:': 'Straight', 
'Occupation:': 'Singer', 
'Cause of death:': 'Accident - Airplane', 
'Brother:': 'Rashad Haughton', 
'Remains:': 'Interred,', 
'Gender:': 'Female', 
'Father:': 'Michael Haughton', 
'Location of death:': 'Marsh Harbour, Abaco Island, Bahamas'}

對於precisel Y本頁面，您還可以湊高中，例如，像這樣：

res['High School'] = soup.find_all('p')[9].text.split(':')[1].strip()

來源

2017-08-09 09:43:53

你介意解釋你的代碼嗎？ –

@Rightleg，你不明白的是什麼？ –

@DmitriyFialkovskiy對URL運行時，它會給出錯誤： 'res [b.text] = b.next_sibling.strip（url ='http：//www.nndb.com/ people/742/000024670 /'' ）.strip（'\ n'） AttributeError：'NoneType'對象沒有屬性'strip'' – Volatil3

您正在尋找next_sibling標籤屬性。這可以爲您提供下一個NavigableString或下一個Tag，具體取決於它先找到的內容。

這裏是你如何使用它：

html = """..."""    
soup = BeautifulSoup(html) 

bTags = soup.find_all('b') 
for it_tag in bTags: 
    print(it_tag.string) 
    print(it_tag.next_sibling)

輸出：

Father: 
Michael Haughton 

Mother: 
Diane 

Brother: 

Rashad Haughton 
Husband: 

Boyfriend:

這似乎有點過。部分原因是由於換行符和空格，您可以使用str.strip方法輕鬆刪除它。

仍然，Boyfriend和Husband條目缺乏價值。這是因爲next_sibling是NavigableString（即str）或Tag。的<b>標籤和標籤<a>這裏被解釋爲一個非空的文本之間的空白：

<b>Boyfriend:</b> <a href="/people/420/000109093/">Damon Dash</a> 
       ^

如果缺席，<b>Boyfriend:</b>的下一個兄弟會的<a>標籤。既然它存在，你必須檢查：

是否下一個兄弟是一個字符串或標籤;
如果它是一個字符串，它是否只包含空格。

如果一個兄弟是唯一的空白字符串，那麼你正在尋找的信息是NavigableString的下一個兄弟，這將是一個<a>標籤。

編輯的代碼：

bTags = soup.find_all('b') 

for it_tag in bTags: 
    print(it_tag.string) 

    nextSibling = it_tag.next_sibling 
    if nextSibling is not None: 
     if isinstance(nextSibling, str): 
      if nextSibling.isspace(): 
       print(it_tag.next_sibling.next_sibling.string.strip()) 
      else: 
       print(nextSibling.strip()) 

     elif isinstance(it_tag.next_sibling, bs4.Tag): 
      print(it_tag.next_sibling.string)

輸出：

Father: 
Michael Haughton 
Mother: 
Diane 
Brother: 
Rashad Haughton 
Husband: 
R. Kelly 
Boyfriend: 
Damon Dash

現在你可以很容易地建立自己的詞典：

entries = {} 
bTags = soup.find_all('b') 

for it_tag in bTags: 
    key = it_tag.string.replace(':', '') 
    value = None 

    nextSibling = it_tag.next_sibling 
    if nextSibling is not None: 
     if isinstance(nextSibling, str): 
      if nextSibling.isspace(): 
       value = it_tag.next_sibling.next_sibling.string.strip() 
      else: 
       value = nextSibling.strip() 

     elif isinstance(it_tag.next_sibling, bs4.Tag): 
      value = it_tag.next_sibling.string 

    entries[key] = value

輸出詞典：

{'Father': 'Michael Haughton', 
'Mother': 'Diane', 
'Brother': 'Rashad Haughton', 
'Husband': 'R. Kelly', 
'Boyfriend': 'Damon Dash'}

來源

2017-08-09 09:43:46

我得到的錯誤'27行，在解析如果it_tag.next_sibling.isspace（）： AttributeError的： 'NoneType' 對象沒有屬性'isspace' – Volatil3

@ Volatil3我編輯了我的代碼。請檢查它是否適用於您。我添加了一個「無」檢查，我壓縮了測試。 –

'key = it_tag.string.replace（'：'，''） AttributeError：'NoneType'對象沒有屬性'replace'' – Volatil3

Python美麗的湯：如何提取標籤旁邊的文字？

回答

相關問題