2017-02-03 78 views
0

我知道有一個日期時間問題 - 不知道在哪裏。當我試圖抓取更老的表格時,我返回的數據是通過今天的數據循環的。我認爲我需要另一個封裝循環才能訪問較舊的頁面python beautifulsoup抓取歸檔頁面

我該如何解決這個問題?

from urlparse import urljoin 
from urllib2 import urlopen 
import requests 
from bs4 import BeautifulSoup 
import re 
from datetime import datetime, timedelta 

url = "http://www.wsj.com/mdc/public/page/2_3022-mfsctrscan-moneyflow-{}.html?mod=mdc_pastcalendar" 
start = datetime.today() 

def only_weekdays_range(start, n): 
    i = 0 
    wk_days = {0, 1, 2, 3, 4} 
    while i != n: 
     while start.weekday() not in wk_days: 
      start -= timedelta(days=1) 
     yield start 
    i += 1 
    start -= timedelta(days=1) 


for _ in (only_weekdays_range(start, 5)): 
    print ("data for {}".format(start.strftime("%b %d %y"))) 
    url = url.format(start.strftime('%Y%m%d')) 
    print 'Retrieving information from: ' + url 
    print '\n' 
    r = requests.get(url) 
    soup = BeautifulSoup(r.content, "lxml") 
    div_main = soup.find('div', {'id': 'column0'}) 
    table_one = div_main.find('table') 
    def target_row(tag): 
     is_row = len(tag.find_all('td')) > 5 
     row_name = tag.name == 'tr' 
     return is_row and row_name 

    rows = table_one.find_all(target_row)[1:] 
#print rows 
    for row in rows: 
     cells = row.findAll('td') 
     industry = cells[0].get_text() 
     data = { 
      'name' : cells[0].get_text() 
     print data 
     print '\n' 

回答

1

你有兩個變量start

  • 全球start = datetime.today()
  • 當地def only_weekdays_range(start, n):

您更改本地start在功能

start -= timedelta(days=1) 

並且您使用yield返回它,然後將其分配給_,for _ in ...但您不使用它。你使用全球性的,沒有改變。

你必須從_使用價值

for new_date in (only_weekdays_range(start, 5)): 
    print ("data for {}".format(new_date.strftime("%b %d %y"))) 
    url = url.format(new_date.strftime('%Y%m%d')) 
    print 'Retrieving information from: ' + url 

但是你必須在功能失常的凹痕

def only_weekdays_range(start, n): 
    i = 0 
    wk_days = {0, 1, 2, 3, 4} 
    while i != n: 
     while start.weekday() not in wk_days: 
      start -= timedelta(days=1) 
     yield start 
     i += 1 
     start -= timedelta(days=1) 

工作示例(即作爲new_date):

from datetime import datetime, timedelta 

# --- functions --- 

def only_weekdays_range(start, n): 
    one_day = timedelta(days=1) 
    for _ in range(n): 
     while start.weekday() > 4: 
      start -= one_day 
     yield start 
     start -= one_day 

# --- main --- 

start = datetime.today() 

for new_date in only_weekdays_range(start, 10): 
    print ("data for {}".format(new_date.strftime("%b %d %y %a"))) 

結果:編輯

data for Feb 03 17 Fri 
data for Feb 02 17 Thu 
data for Feb 01 17 Wed 
data for Jan 31 17 Tue 
data for Jan 30 17 Mon 
data for Jan 27 17 Fri 
data for Jan 26 17 Thu 
data for Jan 25 17 Wed 
data for Jan 24 17 Tue 
data for Jan 23 17 Mon 

:代替ifwhile

def only_weekdays_range(start, n): 
    one_day = timedelta(days=1) 
    for _ in range(n): 
     weekday = start.weekday() 
     if weekday > 4: 
      start -= one_day * (weekday-4) 
     yield start 
     start -= one_day 

編輯:我看其他問題

url = url.format(...) 

你覆蓋url所以在下一個循環中你不能改變它。

使用

full_url = url.format(...) 

r = requests.get(full_url) 
+0

所以文本正在改變「從retreiving信息。」 - 但URL不增加,以反映從舊日期的回報 - 不斷推進從今天的數據 - 說它從舊日期是 –

+0

同樣的問題 - 你必須使用'new_date'而不是'開始' - 這是顯而易見的。 BTW:看到第一個代碼 - 在'url.format(new_date.strftime('%Y%m%d'))''new_date'' – furas

+1

現在我看到其他問題'url = url.format(...) - 你在下一個循環中覆蓋'url',所以你不能更改日期 - 使用'full_url = url.format(...)' – furas