Python高效的Web刮？

我是相當新的Python和我試圖使一個股票應用程序網絡分析器。我基本上使用urllib在參數列表中打開每個股票所需的網頁，並閱讀該頁面的html代碼的完整內容。然後，我正在切片，以便找到我正在尋找的報價。我實施的方法有效，但我懷疑這是實現這一結果的最有效方法。我花了一些時間研究其他潛在的更快速讀取文件的方法，但似乎沒有涉及網絡抓取。這裏是我的代碼：Python高效的Web刮？

from urllib.request import urlopen 

def getQuotes(stocks): 
    quoteList = {} 
    for stock in stocks: 
     html = urlopen("https://finance.google.com/finance?q={}".format(stock)) 
     webpageData = html.read() 
     scrape1 = webpageData.split(str.encode('<span class="pr">\n<span id='))[1].split(str.encode('</span>'))[0] 
     scrape2 = scrape1.split(str.encode('>'))[1] 
     quote = bytes.decode(scrape2) 
     quoteList[stock] = float(quote) 
    return quoteList 

print(getQuotes(['FB', 'GOOG', 'TSLA']))

非常感謝你所有提前！

來源

2017-09-12 Chase Shankula

退房[美麗的湯（https://www.crummy.com/software/BeautifulSoup/bs4/doc/） – Mako212

我會用'requests'包工作，而不是'urllib'直接。我會認爲上面的代碼運行得非常快，不是嗎？當你有很多請求時，你可以看看多線程。應該很好地根據代碼加快速度。 – Andras

哦，是的，並檢查美麗的湯或lxml，如上所述。 – Andras

我基本上使用的urllib打開在參數列表中的每個股票所需的網頁，閱讀該網頁的HTML代碼的全部內容。然後，我正在切片，以便找到我正在尋找的報價。

下面是Beautiful Soup和requests，落實：

import requests 
from bs4 import BeautifulSoup 

def get_quotes(*stocks): 
    quotelist = {} 
    base = 'https://finance.google.com/finance?q={}' 
    for stock in stocks: 
     url = base.format(stock) 
     soup = BeautifulSoup(requests.get(url).text, 'html.parser') 
     quote = soup.find('span', attrs={'class' : 'pr'}).get_text().strip() 
     quotelist[stock] = float(quote) 
    return quotelist 

print(get_quotes('AAPL', 'GE', 'C')) 
{'AAPL': 160.86, 'GE': 23.91, 'C': 68.79} 
# 1 loop, best of 3: 1.31 s per loop

正如你可能想看看multithreading或grequests的評論中提到。

使用grequests進行異步HTTP請求：

def get_quotes(*stocks): 
    quotelist = {} 
    base = 'https://finance.google.com/finance?q={}' 
    rs = (grequests.get(u) for u in [base.format(stock) for stock in stocks]) 
    rs = grequests.map(rs) 
    for r, stock in zip(rs, stocks): 
     soup = BeautifulSoup(r.text, 'html.parser') 
     quote = soup.find('span', attrs={'class' : 'pr'}).get_text().strip() 
     quotelist[stock] = float(quote) 
    return quotelist 

%%timeit 
get_quotes('AAPL', 'BAC', 'MMM', 'ATVI', 
      'PPG', 'MS', 'GOOGL', 'RRC') 
1 loop, best of 3: 2.81 s per loop

更新：這裏是從塵土飛揚菲利普斯Python 3的面向對象的編程使用修改後的版本內置threading模塊。

from threading import Thread 

from bs4 import BeautifulSoup 
import numpy as np 
import requests 


class QuoteGetter(Thread): 
    def __init__(self, ticker): 
     super().__init__() 
     self.ticker = ticker 
    def run(self): 
     base = 'https://finance.google.com/finance?q={}' 
     response = requests.get(base.format(self.ticker)) 
     soup = BeautifulSoup(response.text, 'html.parser') 
     try: 
      self.quote = float(soup.find('span', attrs={'class':'pr'}) 
           .get_text() 
           .strip() 
           .replace(',', '')) 
     except AttributeError: 
      self.quote = np.nan 


def get_quotes(tickers): 
    threads = [QuoteGetter(t) for t in tickers] 
    for thread in threads:   
     thread.start() 
    for thread in threads: 
     thread.join() 
    quotes = dict(zip(tickers, [thread.quote for thread in threads])) 
    return quotes 

tickers = [ 
    'A', 'AAL', 'AAP', 'AAPL', 'ABBV', 'ABC', 'ABT', 'ACN', 'ADBE', 'ADI', 
    'ADM', 'ADP', 'ADS', 'ADSK', 'AEE', 'AEP', 'AES', 'AET', 'AFL', 'AGN', 
    'AIG', 'AIV', 'AIZ', 'AJG', 'AKAM', 'ALB', 'ALGN', 'ALK', 'ALL', 'ALLE', 
    ] 

%time get_quotes(tickers) 
# Wall time: 1.53 s

來源

2017-09-12 21:11:53

您與BeautifulSoup第一個解決方案實際上最終是比我最初的實現略慢......但噢男孩，有grequests配對它確實的伎倆！更快的結果。再次感謝！ –

@ChaseShankula是的，並不感到驚訝 - BeautifulSoup的速度並不是特別着名。在這種情況下，佔用時間的是底層請求和解析器。什麼BS4是用於從一個文件[樹]拉動多個數據片有用（http://web.simmons.edu/~grabiner/comm244/weekfour/document-tree.html）。有通過[文件]閱讀（https://www.crummy.com/software/BeautifulSoup/bs4/doc/）時，你可以，它會在某個時候在路上派上用場。 –

@ChaseShankula更新爲使用'threading'而不是'grequests'，因爲我遇到了一些問題。 –

Python高效的Web刮？

回答

相關問題