使用BeautifulSoup從網頁獲取特定表格

我想從http://www.dividend.com/dividend-stocks/的第三個表中獲取數據。這裏是代碼，我需要一些幫助。使用BeautifulSoup從網頁獲取特定表格

import requests 
from bs4 import BeautifulSoup 

url = "http://www.dividend.com/dividend-stocks/" 
r = requests.get(url) 
soup = BeautifulSoup(r.content, "html5lib") 

# Skip first two tables 
tables = soup.find("table") 
tables = tables.find_next("table") 
tables = tables.find_next("table") 

row = '' 
for td in tables.find_all("td"): 
    if len(td.text.strip()) > 0: 
     row = row + td.text.strip().replace('\n', ' ') +',' 
     # Handle last column in a row, remove extra comma and add new line 
     if td.get('data-th') == 'Pay Date': 
      row = row[:-1] + '\n' 
print(row)

有沒有更好的方式來跳過兩個表？還是有一種簡單的方法可以跳過美麗的湯中的大塊代碼？如果是這樣，我該如何定位它？
不知何故，代碼的輸出順序與網絡上的不同。在網絡上的表看起來像這樣：

但代碼的輸出是這樣的：

AAPL,Apple Inc.,1.76%,$143.39,$2.52,5/11,5/18 
GE,General Electric,3.32%,$28.91,$0.96,6/15,7/25 
XOM,Exxon Mobil,3.71%,$83.03,$3.08,5/10,6/9 
CVX,Chevron Corp,4.01%,$107.72,$4.32,5/17,6/12 
BP,BP PLC ADR,6.66%,$35.72,$2.38,5/10,6/23

我做了什麼錯？謝謝你的幫助！

來源

2017-06-16 fuzzyworm

請給你的問題一個更具體的描述它的標題。這裏的大多數問題都是關於無法按預期工作的代碼。 – Barmar

@Barmar謝謝指出。下次我會更加小心。 – fuzzyworm

@fuzzyworm看起來您將它們保存爲CSV格式，但有3家公司名稱中帶有逗號，因此您可能希望將公司名稱放在雙引號內。 'Qualcomm，Inc'，'Banco Santander，S.A.'，'Activision Blizzard，Inc.' –

您可以使用選擇器來查找特定的表：

tables = soup.select("table:nth-of-type(3)")

我不知道爲什麼你的結果是不同的順序比他們出現在網頁上。

來源

2017-06-16 22:54:17 Barmar

這部分工作很棒！謝謝。 – fuzzyworm

雖然@Barmar的方法看起來更清晰，但這是另一種使用soup.find_all並保存到JSON的替代方法（儘管這不在說明中）。

import json 

import requests 
from bs4 import BeautifulSoup 

url = 'http://www.dividend.com/dividend-stocks/' 
r = requests.get(url) 
r.raise_for_status() 
soup = BeautifulSoup(r.content, 'lxml') 
stocks = {} 

# Skip first two tables and header row of target table 
for tr in soup.find_all('table')[2].find_all('tr')[1:]: 
    (stock_symbol, company_name, _, dividend_yield, current_price, 
    annual_dividend, ex_dividend_date, pay_date) = [ 
     td.text.strip() for td in tr.find_all('td')] 
    stocks[stock_symbol] = { 
     'company_name': company_name, 
     'dividend_yield': float(dividend_yield.rstrip('%')), 
     'current_price': float(current_price.lstrip('$')), 
     'annual_dividend': float(annual_dividend.lstrip('$')), 
     'ex_dividend_date': ex_dividend_date, 
     'pay_date': pay_date 
    } 

with open('stocks.json', 'w') as f: 
    json.dump(stocks, f, indent=2)

來源

2017-06-16 23:40:18

謝謝你做的代碼。總有一種不同的方式來解決問題。我無法讓它在我的Windows機器上工作。需要一點研究。如果我無法弄清楚會請求幫助。再次感謝。 – fuzzyworm

沒問題。我確實使用'lxml'作爲HTML解析器（因爲我已經安裝了它）。如果這是問題，您可以嘗試安裝它或將其更改回原來使用的HTML解析器。 –

是的，使用不同的解析器，一切正常。謝謝。 – fuzzyworm

感謝@Barmar和@Delirious生菜發佈的解決方案和代碼。關於輸出的順序，我意識到每次刷新數據時，都會看到按照輸出順序看到的數據。然後我看到排序後的數據。嘗試了幾種不同的方式，我能夠使用Selenium webdriver來提取網絡中呈現的數據。謝謝大家。

BPT,BP Prudhoe Bay Royalty Trust,21.12%,$20.80,$4.39,4/11,4/20 
PER,Sandridge Permian Trust,18.06%,$2.88,$0.52,5/10,5/26 
CHKR,Chesapeake Granite Wash Trust,16.75%,$2.40,$0.40,5/18,6/1 
NAT,Nordic American Tankers,13.33%,$6.00,$0.80,5/18,6/8 
WIN,Windstream Corp,13.22%,$4.54,$0.60,6/28,7/17 
NYMT,New York Mortgage Trust Inc,12.14%,$6.59,$0.80,6/22,7/25 
IEP,Icahn Enterprises L.P.,11.65%,$51.50,$6.00,5/11,6/14 
FTR,Frontier Communications,11.51%,$1.39,$0.16,6/13,6/30

來源

2017-06-19 16:10:01 fuzzyworm

使用BeautifulSoup從網頁獲取特定表格

回答

相關問題