使用python通過分頁表格刮取數據

我通過Google財經的歷史頁面爲股票（http://www.google.com/finance/historical?q=NSE%3ASIEMENS&ei=PLfUVIDTDuSRiQKhwYGQBQ）刮取數據。使用python通過分頁表格刮取數據

我可以在當前頁面上刮掉30行。我面臨的問題是我無法通過表格中的其餘數據（31-241行）。我如何轉到下一頁或鏈接。以下是我的代碼：

import urllib2 
import xlwt #to write into excel spreadsheet 
from bs4 import BeautifulSoup 

# Main Coding Section 

stock_links = open('stock_link_list.txt', 'r') #opening text file for reading 

#url="https://www.google.com/finance/historical?q=NSE%3ASIEMENS&ei=zHXOVLPnApG2iALxxYCADQ" 
for url in stock_links: 
    OurFile = urllib2.urlopen(url) 
    OurHtml = OurFile.read() 
    OurFile.close() 
soup = BeautifulSoup(OurHtml) 
#soup1 = soup.find("div", {"class": "gf-table-wrapper sfe-break-bottom-16"}).get_text() 
soup1 = soup.find("table", {"class": "gf-table historical_price"}).get_text() 

end = url.index('&') 
filename = url[47:end] 
file = open(filename, 'w') #opening text file for writing 
file.write(soup1) 
#file.write(soup1.get_text()) #writing to the text file 
file.close()   #closing the text file

來源

2015-02-06 NitheshKHP

您將有微調它，我會趕上更具體的錯誤，但你可以保持增加start獲得下一個數據：

url = "https://www.google.com/finance/historical?q=NSE%3ASIEMENS&ei=W8LUVLHnAoOswAOFs4DACg&start={}&num=30" 

from bs4 import BeautifulSoup 
import requests 
# Main Coding Sectio 
start = 0 
while True: 
    try: 
     nxt = url.format(start) 
     r = requests.get(nxt) 
     soup = BeautifulSoup(r.content) 
     print(soup.find("table",{"class": "gf-table historical_price"}).get_text()) 
    except Exception as e: 
     print(e) 
     break 
    start += 30

這得到所有直到最後日期2月7日的表中的數據：

...... 

Date 
Open 
High 
Low 
Close 
Volume 

Feb 7, 2014 
552.60 
557.90 
548.25 
551.50 
119,711

來源

2015-02-06 14:12:01

謝謝帕德里克C.你的回答讓我今天新學到一些東西。我在我現有的鏈接列表中添加了「＆start = {}」。它像一個魅力。由於我缺乏聲望點，我無法提出您的答案。我有積分的一天，我會來這裏，並upvote這個真棒的答案。 – NitheshKHP 2015-02-06 16:34:59

@NitheshKHP，不用擔心。 – 2015-02-06 16:36:48

尋找一見鍾情Row Limit選項允許每頁可顯示最多30行但我手動更改查詢字符串參數更大的數字，實現我們可以查看每個

頁最多200行

更改URL，以

https://www.google.com/finance/historical?q=NSE%3ASIEMENS&ei=OM3UVLFtkLnzBsjIgYAI&start=0&num=200

它會顯示200行

，然後改變start=200&num=400

但更合乎邏輯地說，如果你有很多其他sunch類型的鏈接。

然後你就可以刮掉分頁區域，最後TR，抓住下一個頁面的這些鏈接和刮

來源

2015-02-06 14:24:01 Umair

謝謝Umair。在您的建議後，我確實使用過網址，幫助我改進了我的代碼。 – NitheshKHP 2015-02-06 16:38:58

你應該添加如何刮掉鏈接等代碼。 – 2015-02-06 17:04:21

使用python通過分頁表格刮取數據

回答

相關問題