使用BeautifulSoup從html-doc中提取數據時遇到困難

我試圖從網頁中提取數據，並且發現它非常困難。我試過soup.get_Text()，但它沒有什麼好處，因爲它只是返回一個字符而不是整個字符串對象。使用BeautifulSoup從html-doc中提取數據時遇到困難

提取名稱很容易，因爲您可以通過'b'-tag訪問該名稱，但是例如提取街道（「AmVogelwäldchen2」）證明相當困難。我可以嘗試從單個字符集合地址，但這看起來過於複雜，我覺得必須有一個更簡單的方法來做到這一點。也許有人有一個更好的主意。哦，不介意奇怪的功能，我回來了湯，因爲我嘗試了不同的方法。

import urllib.request 
import time 

from bs4 import BeautifulSoup 


#Performs a HTTP-'POST' request, passes it to BeautifulSoup and returns the result 
def doRequest(request): 
    requestResult = urllib.request.urlopen(request) 
    soup = BeautifulSoup(requestResult) 
    return soup 

def getContactInfoFromPage(page): 
    name = '' 
    straße = '' 
    plz = '' 
    stadt = '' 
    telefon = '' 
    mail = '' 
    url = '' 

    data = [ 
      #'Name', 
      #'Straße', 
      #'PLZ', 
      #'Stadt', 
      #'Telefon', 
      #'E-Mail', 
      #'Homepage' 
      ] 

    request = urllib.request.Request("http://www.altenheim-adressen.de/schnellsuche/" + page) 
    request.add_header("Content-Type", "application/x-www-form-urlencoded;charset=utf-8") 
    request.add_header("User-Agent", "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:33.0) Gecko/20100101 Firefox/33.0") 
    soup = doRequest(request) 

    #Save Name to data structure 
    findeName = soup.findAll('b') 
    name = findeName[2] 
    name = name.string.split('>') 

    data.append(name) 


    return soup 


soup = getContactInfoFromPage("suche2.cfm?id=267a0749e983c7edfeef43ef8e1c7422") 

print(soup.getText())

來源

2014-11-23 Fresh Prince

謝謝，我會嘗試，當我回家。 – 2014-11-23 18:43:47

您可以依靠現場標籤並獲得next sibling的文本。

從這個製作一個漂亮的可重複使用的功能，將使其更加透明和易於使用：

def get_field_value(soup, field): 
    field_label = soup.find('td', text=field + ':') 
    return field_label.find_next_sibling('td').get_text(strip=True)

用法：

print(get_field_value(soup, 'Name')) # prints 'AWO-Seniorenzentrum Kenten' 
print(get_field_value(soup, 'Land')) # prints 'Deutschland'

來源

2014-11-23 18:32:47 alecxe

非常感謝，完美的工作。 – 2014-11-23 21:32:00

@FreshPrince很高興幫助，謝謝。 – alecxe 2014-11-23 21:36:15

使用BeautifulSoup從html-doc中提取數據時遇到困難

回答

相關問題