Beautifulsoup刮除另一個單元格旁邊的單元格的內容

我想刮除除了另一個單元格之外的單元格的內容，「Staatsform」，「Amtssprache」，「Postleitzahl」等。在圖片中，所需的內容總是在正確的單元格中。Beautifulsoup刮除另一個單元格旁邊的單元格的內容

的基本代碼是以下一個，但我還是堅持了下來：

source_code = requests.get('https://de.wikipedia.org/wiki/Hamburg') 
plain_text = source_code.text      
soup = BeautifulSoup(plain_text, "html.parser")  
stastaform = soup.find(text="Staatsform:")...???

提前非常感謝！

來源

2017-06-12 saitam

請包括描述兩個感興趣的單元格的HTML片段。 – DyZ

你只想要單元格中的文本，還是更多？ –

這工作的大部分時間：

def get_content_from_right_column_for_left_column_containing(text): 
    """return the text contents of the cell adjoining a cell that contains `text`""" 

    navigable_strings = soup.find_all(text=text) 

    if len(navigable_strings) > 1: 
     raise Exception('more than one element with that text!') 

    if len(navigable_strings) == 0: 

     # left-column contents that are links don't have a colon in their text content... 
     if ":" in text: 
      altered_text = text.replace(':', '') 

     # but `td`s and `th`s do. 
     else: 
      altered_text = text + ":" 

     navigable_strings = soup.find_all(text=altered_text) 

    try: 
     return navigable_strings[0].find_parent('td').find_next('td').text 
    except IndexError: 
     raise IndexError('there are no elements containing that text.')

來源

2017-06-12 17:01:12

我想在限制搜索到什麼是所謂的英文維基百科的「信息框」必須小心。因此，我首先搜索標題'Basisdaten'，要求它是一個th元素。可能並不完全確定，但可能性更大。發現我在'Basisdaten'下查找tr元素，直到我找到另一個tr，包括一個（推測不同的）標題。在這種情況下，我搜索'Postleitzahlen：'，但是這種方法可以找到'Basisdaten'和下一個標題之間的任何/所有項目。

PS：我還應該提一下if not current.name的原因。我注意到一些行由BeautifulSoup視爲字符串的新行組成。這些沒有名稱，因此需要在代碼中專門對待它們。

import requests 
import bs4 
page = requests.get('https://de.wikipedia.org/wiki/Hamburg').text 
soup = bs4.BeautifulSoup(page, 'lxml') 
def getInfoBoxBasisDaten(s): 
    return str(s) == 'Basisdaten' and s.parent.name == 'th' 

basisdaten = soup.find_all(string=getInfoBoxBasisDaten)[0] 

wanted = 'Postleitzahlen:' 
current = basisdaten.parent.parent.nextSibling 
while True: 
    if not current.name: 
     current = current.nextSibling 
     continue 
    if wanted in current.text: 
     items = current.findAll('td') 
     print (items[0]) 
     print (items[1]) 
    if '<th ' in str(current): break 
    current = current.nextSibling

結果是這樣的：兩個單獨的td元素，請求。

<td><a href="/wiki/Postleitzahl_(Deutschland)" title="Postleitzahl (Deutschland)">Postleitzahlen</a>:</td> 
<td>20095–21149,<br/> 
22041–22769,<br/> 
<a href="/wiki/Neuwerk_(Insel)" title="Neuwerk (Insel)">27499</a></td>

來源

2017-06-12 18:09:50

如果我使用'BeautifulSoup.get_text（）'去除html腳本等，這似乎對我有用。但不幸的是，我在這個網站上得到一個錯誤：'https：// de.wikipedia.org/wiki/Bremen'。你知道這是什麼嗎？ – saitam

我剛剛查看了兩頁的維基代碼（在* Bearbeiten *視圖中）。他們採取完全不同的方式來設置頁面的格式，因此HTML是不同的。我沒有高中以上的德語。我現在看到，不來梅網頁上有一個「Infobox」，但不是在漢堡。這與英文維基百科中的情況相同。如果你想刮掉它，那麼你需要能夠識別你正在處理的格式和處理方式。 –

Beautifulsoup刮除另一個單元格旁邊的單元格的內容

回答

相關問題