Python多線程高內存使用問題

我正在使用多線程和隨機代理來刮取網頁。我的家用電腦可以很好地處理這個問題，但是需要很多進程（在當前的代碼中，我將它設置爲100）。內存使用量似乎達到2.5GB左右。然而，當我在CentOS VPS上運行這個時，我得到一個通用的'Killed'消息，程序終止。隨着100個進程的運行，我會非常快速地得到Killed錯誤。我把它降低到更合理的8，但仍然得到了同樣的錯誤，但經過了更長的時間。基於一些研究，我假設'Killed'錯誤與內存使用有關。沒有多線程，錯誤不會發生。Python多線程高內存使用問題

那麼，我可以做些什麼來優化我的代碼，以便快速運行，但不能使用太多內存？我最好只是進一步減少流程數量？我可以在程序運行時從Python內部監視我的內存使用情況嗎？

編輯：我剛剛意識到我的VPS在我的桌面上有24MB的256MB內存，這是我最初編寫代碼時不考慮的。

#Request soup of url, using random proxy/user agent - try different combinations until valid results are returned 
def getsoup(url): 
    attempts = 0 
    while True: 
     try: 
      proxy = random.choice(working_proxies) 
      headers = {'user-agent': random.choice(user_agents)} 
      proxy_dict = {'http': 'http://' + proxy} 
      r = requests.get(url, headers, proxies=proxy_dict, timeout=5) 
      soup = BeautifulSoup(r.text, "html5lib") #"html.parser" 
      totalpages = int(soup.find("div", class_="pagination").text.split(' of ',1)[1].split('\n', 1)[0]) #Looks for totalpages to verify proper page load 
      currentpage = int(soup.find("div", class_="pagination").text.split('Page ',1)[1].split(' of', 1)[0]) 
      if totalpages < 5000: #One particular proxy wasn't returning pagelimit=60 or offset requests properly ..    
       break 
     except Exception as e: 
      # print 'Error! Proxy: {}, Error msg: {}'.format(proxy,e) 
      attempts = attempts + 1   
      if attempts > 30: 
       print 'Too many attempts .. something is wrong!' 
       sys.exit() 
    return (soup, totalpages, currentpage) 

#Return soup of page of ads, connecting via random proxy/user agent 
def scrape_url(url): 
    soup, totalpages, currentpage = getsoup(url)    
    #Extract ads from page soup 

    ###[A bunch of code to extract individual ads from the page..] 

    # print 'Success! Scraped page #{} of {} pages.'.format(currentpage, totalpages) 
    sys.stdout.flush() 
    return ads  

def scrapeall():  
    global currentpage, totalpages, offset 
    url = "url" 

    _, totalpages, _ = getsoup(url + "0") 
    url_list = [url + str(60*i) for i in range(totalpages)] 

    # Make the pool of workers 
    pool = ThreadPool(100)  
    # Open the urls in their own threads and return the results 
    results = pool.map(scrape_url, url_list) 
    # Close the pool and wait for the work to finish 
    pool.close() 
    pool.join() 

    flatten_results = [item for sublist in results for item in sublist] #Flattens the list of lists returned by multithreading 
    return flatten_results 

adscrape = scrapeall()

來源

2016-02-27 Testy8

最有可能的只有256MB內存，即使它不是多線程，也會因內存使用量過大而終止進程。你必須記住，即使所有的256MB都不可用。取決於頁面，刮擦使用大量內存。 –

你想排隊請求？ – user3159253

彼得，我能做些什麼來減少內存使用量？我已經刪除了多線程，是的，它仍然崩潰 – Testy8

BeautifulSoup是純Python庫，在中檔網站上它會吃掉很多內存。如果它是一個選項，請嘗試用替換它，該文件速度更快，並用C語言編寫。如果頁面很大，它可能仍會耗盡內存。

正如評論中的建議，您可以使用queue.Queue來存儲回覆。一個更好的版本將是檢索對磁盤的響應，將文件名存儲在隊列中，並將它們解析爲單獨的進程。爲此，您可以使用multiprocessing庫。如果解析耗盡內存並被殺死，則將繼續提取。這種模式被稱爲fork和die，是Python使用太多內存的常用解決方法。

然後，您還需要有一種方法來查看哪些響應解析失敗。

來源

2016-02-27 23:41:33 hruske

Python多線程高內存使用問題

回答

相關問題