我正在使用多線程和隨機代理來刮取網頁。我的家用電腦可以很好地處理這個問題,但是需要很多進程(在當前的代碼中,我將它設置爲100)。內存使用量似乎達到2.5GB左右。然而,當我在CentOS VPS上運行這個時,我得到一個通用的'Killed'消息,程序終止。隨着100個進程的運行,我會非常快速地得到Killed錯誤。我把它降低到更合理的8,但仍然得到了同樣的錯誤,但經過了更長的時間。基於一些研究,我假設'Killed'錯誤與內存使用有關。沒有多線程,錯誤不會發生。Python多線程高內存使用問題
那麼,我可以做些什麼來優化我的代碼,以便快速運行,但不能使用太多內存?我最好只是進一步減少流程數量?我可以在程序運行時從Python內部監視我的內存使用情況嗎?
編輯:我剛剛意識到我的VPS在我的桌面上有24MB的256MB內存,這是我最初編寫代碼時不考慮的。
#Request soup of url, using random proxy/user agent - try different combinations until valid results are returned
def getsoup(url):
attempts = 0
while True:
try:
proxy = random.choice(working_proxies)
headers = {'user-agent': random.choice(user_agents)}
proxy_dict = {'http': 'http://' + proxy}
r = requests.get(url, headers, proxies=proxy_dict, timeout=5)
soup = BeautifulSoup(r.text, "html5lib") #"html.parser"
totalpages = int(soup.find("div", class_="pagination").text.split(' of ',1)[1].split('\n', 1)[0]) #Looks for totalpages to verify proper page load
currentpage = int(soup.find("div", class_="pagination").text.split('Page ',1)[1].split(' of', 1)[0])
if totalpages < 5000: #One particular proxy wasn't returning pagelimit=60 or offset requests properly ..
break
except Exception as e:
# print 'Error! Proxy: {}, Error msg: {}'.format(proxy,e)
attempts = attempts + 1
if attempts > 30:
print 'Too many attempts .. something is wrong!'
sys.exit()
return (soup, totalpages, currentpage)
#Return soup of page of ads, connecting via random proxy/user agent
def scrape_url(url):
soup, totalpages, currentpage = getsoup(url)
#Extract ads from page soup
###[A bunch of code to extract individual ads from the page..]
# print 'Success! Scraped page #{} of {} pages.'.format(currentpage, totalpages)
sys.stdout.flush()
return ads
def scrapeall():
global currentpage, totalpages, offset
url = "url"
_, totalpages, _ = getsoup(url + "0")
url_list = [url + str(60*i) for i in range(totalpages)]
# Make the pool of workers
pool = ThreadPool(100)
# Open the urls in their own threads and return the results
results = pool.map(scrape_url, url_list)
# Close the pool and wait for the work to finish
pool.close()
pool.join()
flatten_results = [item for sublist in results for item in sublist] #Flattens the list of lists returned by multithreading
return flatten_results
adscrape = scrapeall()
最有可能的只有256MB內存,即使它不是多線程,也會因內存使用量過大而終止進程。你必須記住,即使所有的256MB都不可用。取決於頁面,刮擦使用大量內存。 –
你想排隊請求? – user3159253
彼得,我能做些什麼來減少內存使用量?我已經刪除了多線程,是的,它仍然崩潰 – Testy8