2016-12-07 80 views
0

刮一個網頁,遇到「IndexError:列表索引超出範圍」 敢肯定,這是因爲在我刮表中的行使用的標題 - http://www.wsj.com/mdc/public/page/2_3022-mfsctrscan-moneyflow-20161205.html?mod=mdc_pastcalendaPython的刮,跳繩<tr>標籤和行

from urllib2 import urlopen 
import requests 
from bs4 import BeautifulSoup 
import re 
import datetime 

date = datetime.datetime.today() 
url = "http://www.wsj.com/mdc/public/page/2_3022-mfsctrscan-moneyflow- 20161205.html?mod=mdc_pastcalendar" 
date_time = urlopen(url.format(date=date.strftime('%Y%m%d'))) 
address = url 
print 'Retrieving information from: ' + address 
print '\n' 
soup = BeautifulSoup (requests.get(address).content, "lxml") 
div_main = soup.find('div', {'id': 'column0'}) 
table_one = div_main.find('table') 
rows = table_one.findAll('tr') 
if len(soup.findAll('tr')) > 0: 
rows = rows[2:] 
#print rows 
for row in rows: 
    cells = row.findAll('td') 
    name = cells[0].get_text() 
    last = cells[1].get_text() 
    chg = cells[2].get_text() 
    pct_chg = cells[3].get_text() 
    money_flow = cells[4].get_text() 
    tick_up = cells[5].get_text() 
    tick_down = cells[6].get_text() 
    up_down_Ratio = cells[7].get_text() 
    money_flow = cells[8].get_text() 
    tick_up = cells[9].get_text() 
    tick_down = cells[10].get_text() 
    up_down_Ratio = cells[11].get_text() 

回答

1

像「道瓊斯美國股票市場總計」這樣的具有單個單元格的中間行是您出現此錯誤的原因。

但是,相反,你爲什麼不預先定義標題的列表,並動態地創建從「數據」行的值與標題的列表荏苒的字典:

rows = soup.select('div#column0 table tr')[2:] 

headers = ['name', 'last', 'chg', 'pct_chg', 
      'total_money_flow', 'total_tick_up', 'total_tick_down', 'total_up_down_ratio', 
      'block_money_flow', 'block_tick_up', 'block_tick_down', 'block_up_down_ratio'] 
for row in rows: 
    # skip non-data rows 
    if row.find("td", class_="pnum") is None: 
     continue 

    print(dict(zip(headers, [cell.get_text(strip=True) for cell in row.find_all('td')]))) 
+0

謝謝 - 我相信這樣可以更容易地將值存儲在未來 –

1
div_main = soup.find('div', {'id': 'column0'}) 
table_one = div_main.find('table') 

# to id the right row 
def target_row(tag): 
    is_row = len(tag.find_all('td')) > 5 
    row_name = tag.name == 'tr' 
    return is_row and row_name 

rows = table_one.find_all(target_row) 
for row in rows: 
    cells = row.findAll('td') 
    name = cells[0].get_text() 
    last = cells[1].get_text() 
    chg = cells[2].get_text() 
    pct_chg = cells[3].get_text() 
    money_flow = cells[4].get_text() 
    tick_up = cells[5].get_text() 
    tick_down = cells[6].get_text() 
    up_down_Ratio = cells[7].get_text() 
    money_flow = cells[8].get_text() 
    tick_up = cells[9].get_text() 
    tick_down = cells[10].get_text() 
    up_down_Ratio = cells[11].get_text() 

你可以使用一個返回一個布爾作爲查找參數的函數,這樣,你的代碼是非常乾淨和可維護的。

+0

謝謝 - 你會碰巧知道如何使用日期時間來抓取日期 - 動態URLS以獲取數據回到某個範圍,即20000101? –