使用BS4從雅虎金融

我想下面的代碼讀取從雅虎財經的歷史CSV數據讀取網址：使用BS4從雅虎金融

import datetime 
import time 
from bs4 import BeautifulSoup 


per1 = str(int(time.mktime((datetime.datetime.today() - td(days=365)).timetuple()))) 
per2 = str(int(time.mktime((datetime.datetime.today()).timetuple()))) 
url = 'https://query1.finance.yahoo.com/v7/finance/download/MSFT?period1=' + per1 + '&period2=' + per2 + '&interval=1d&events=history&crumb=OQg/YFV3fvh'

當你去雅虎財經，鍵入一個股票的url變量可以看出和將鼠標懸停在「下載數據」按鈕上。

我得到我相信這是由於缺少cookie的，所以我嘗試了以下身份驗證錯誤：

import requests 
ses = requests.Session() 
url1 = 'https://finance.yahoo.com/quote/MSFT/history?p=MSFT' 
ses.get(url1) 
soup = BeautifulSoup(ses.get(url).content) 
print soup.prettify()

我得到不正確Cookie錯誤這一次。

有人可以建議如何解決這個問題嗎？

來源

2017-10-28 Zanam

你可以發佈錯誤的堆棧跟蹤？ – alex

請注意，這裏有一個[熊貓圖書館]（https://pandas-datareader.readthedocs.io）。它工作得很好。 –

我遇到了顯示提取雅虎財務數據錯誤的線程，但似乎已經修復。謝謝。接受的答案也是我必須說的一個很好的學習工具。 – Zanam

查詢字符串的crumb參數不斷變化，可能是每個瀏覽器會話。所以，當你從瀏覽器中複製它的值時，關閉它，然後在瀏覽器的另一個實例中使用它，它會在那時過期。

因此，當您在requests會話中使用它時，它應該毫不奇怪，它不會識別cookie值並生成錯誤。

步驟1

學習在任何瀏覽器將幫助網絡選項卡。在這種特殊情況下，這個crumb部分可能是在您點擊主頁中的代碼時生成的。所以你必須先獲取該URL。

tickers = ('000001.SS', 'NKE', 'MA', 'SBUX') 
url = 'https://finance.yahoo.com/quote/{0}?p={0}'.format(tickers[0]) 
r = s.get(url, headers = req_headers)

該URL只需要提取一次。因此，您使用此代碼的代碼並不重要。

步驟2

服務器返回的響應包含在下載CSV文件，傳遞給在查詢字符串crumb參數的值。

但是，它包含在由前一個請求返回的頁面的script標記中。這意味着您不能單獨使用BeautifulSoup來提取crumb值。

我最初試圖用re從script標籤的文本中提取出來。但由於某種原因，我無法做到。所以我轉移到json進行解析。

soup = BeautifulSoup(r.content, 'lxml') 
script_tag = soup.find(text=re.compile('crumb')) 

response_dict = json.loads(script_tag[script_tag.find('{"context":'):script_tag.find('}}}};') + 4]) 
crumb = response_dict['context']['dispatcher']['stores']['CrumbStore']['crumb']

注意BeautifulSoup需要提取script元素的內容將在後面傳遞給json到它解析到Python dict對象。

我不得不使用pprint將結果dict打印到一個文件中，以準確查看crumb值的存儲位置。

步驟3

，其獲取的CSV文件的最後URL看起來是這樣的：

for ticker in tickers: 
    csv_url = 'https://query1.finance.yahoo.com/v7/finance/download/{0}?period1=1506656676&period2=1509248676&interval=1d&events=history&crumb={1}'.format(ticker, crumb) 

    r = s.get(csv_url, headers = req_headers)

結果

下面是一個文件的前幾行下載：

Date,Open,High,Low,Close,Adj Close,Volume 
2017-09-29,3340.311035,3357.014893,3340.311035,3348.943115,3348.943115,144900 
2017-10-09,3403.246094,3410.169922,3366.965088,3374.377930,3374.377930,191700 
2017-10-10,3373.344971,3384.025879,3358.794922,3382.988037,3382.988037,179400

注意：

我在這兩個請求中使用了適當的標頭。所以如果你跳過那部分而沒有得到理想的結果，你可能也必須包含它們。

req_headers = { 
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8', 
    'accept-encoding': 'gzip, deflate, br', 
    'accept-language': 'en-US,en;q=0.8', 
    'upgrade-insecure-requests': '1', 
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36' 
}

來源

2017-10-29 04:27:37 Mahesh

「在任何瀏覽器中學習網絡選項卡都會有所幫助」：在網絡選項卡下，您查找哪個特定變量？ – Zanam

使用BS4從雅虎金融

回答

相關問題