如何使用BeautilSoup提取表信息？

我試圖從這些pages刮信息。如何使用BeautilSoup提取表信息？

我需要Internship,Residency,Fellowship中包含的信息。我可以從表中提取值，但是在這種情況下，該表存在，其價值我不能決定使用哪個表，因爲標題（如Internship）是表作爲一個簡單的純文本之外的div標籤下存在，並經過我需要提取。而且我有很多這種類型的頁面，每個頁面都沒有必要具有這些值，例如在某些頁面中可能完全不存在Residency。（這會減少頁面中的表總數）。這種頁面的一個例子是this。在這個頁面Internship根本不存在。

我現在面臨的主要問題是所有的表都具有相同的屬性值，所以我不能決定中使用不同的頁面的表。如果我的興趣值沒有出現在頁面中，則必須返回該值的空字符串。

我使用Python中BeautifulSoup。有人可以指出，我怎麼能繼續提取這些值。

來源

2013-02-18 Steve

它看起來像IDS的標題和數據均擁有獨特的價值和標準的後綴。您可以使用它來搜索適當的值。這裏是我的解決方案：

from BeautifulSoup import BeautifulSoup 

# Insert whatever networking stuff you're doing here. I'm going to assume 
# that you've already downloaded the page and assigned it to a variable 
# named 'html' 

soup = BeautifulSoup(html) 
headings = ['Internship', 'Residency', 'Fellowship'] 
values = [] 
for heading in headings: 
    x = soup.find('span', text=heading) 
    if x: 
     span_id = x.parent['id'] 
     table_id = span_id.replace('dnnTITLE_lblTitle', 'Display_HtmlHolder')   
     values.append(soup.find('td', attrs={'id': table_id}).text) 
    else: 
     values.append('') 

print zip(headings, values)

來源

2013-02-19 02:48:53 deadfoxygrandpa

由於它的工作！ – Steve 2013-02-19 03:50:45

如何使用BeautilSoup提取表信息？

回答

相關問題