2016-10-01 114 views
2

我正在嘗試使用Python請求進行操作。這裏是我的代碼:Python,請求,線程,python請求關閉其套接字的速度有多快?

import threading 
import resource 
import time 
import sys 

#maximum Open File Limit for thread limiter. 
maxOpenFileLimit = resource.getrlimit(resource.RLIMIT_NOFILE)[0] # For example, it shows 50. 

# Will use one session for every Thread. 
requestSessions = requests.Session() 
# Making requests Pool bigger to prevent [Errno -3] when socket stacked in CLOSE_WAIT status. 
adapter = requests.adapters.HTTPAdapter(pool_maxsize=(maxOpenFileLimit+100)) 
requestSessions.mount('http://', adapter) 
requestSessions.mount('https://', adapter) 

def threadAction(a1, a2): 
    global number 
    time.sleep(1) # My actions with Requests for each thread. 
    print number = number + 1 

number = 0 # Count of complete actions 

ThreadActions = [] # Action tasks. 
for i in range(50): # I have 50 websites I need to do in parallel threads. 
    a1 = i 
    for n in range(10): # Every website I need to do in 3 threads 
     a2 = n 
     ThreadActions.append(threading.Thread(target=threadAction, args=(a1,a2))) 


for item in ThreadActions: 
    # But I can't do more than 50 Threads at once, because of maxOpenFileLimit. 
    while True: 
     # Thread limiter, analogue of BoundedSemaphore. 
     if (int(threading.activeCount()) < threadLimiter): 
      item.start() 
      break 
     else: 
      continue 

for item in ThreadActions: 
    item.join() 

但事實是,經過我得到50個線程時,該Thread limiter開始等待一些線程完成其工作。這是問題。在scrit前往限制器後,lsof -i|grep python|wc -l顯示遠遠少於50個活動連接。但是在限制器之前它已經顯示了所有的< = 50個過程。這是爲什麼發生?或者我應該使用requests.close()而不是requests.session()來阻止它使用已經運行的套接字?

+0

您的線程限制器進入一個緊密的循環,並消耗大部分處理時間。嘗試像「睡眠(.1)」這樣的放慢速度。更好的是,使用限制爲50個請求的隊列,讓你的線程讀取這些請求。 – tdelaney

+0

關於增加用戶操作系統的限制,請查找[ulimit](http://stackoverflow.com/questions/6774724/why-python-has-limit-for-count-of-file-handles)和[fs .file-MAX](https://cs.uwaterloo.ca/~brecht/servers/openfiles.html)。在這樣做之後,在增加python內部的限制時,請查找[setrlimit](https://coderwall.com/p/ptq7rw/increase-open-files-limit-and-drop-privileges-in-python)。當然,確保你沒有不必要地運行busy-while-loop並且正確地複用你的代碼。 – blackpen

+0

是的,我明白,並在我使用BoundedSemaphore的真實腳本。但是爲什麼在腳本達到極限之後,lsof -i | grep python | wc -l'顯示的數字要低得多? – passwd

回答

1

您的限制器是一個緊密的循環,佔用了大部分處理時間。使用線程池來限制工作人員數量。

import multiprocessing.pool 

# Will use one session for every Thread. 
requestSessions = requests.Session() 
# Making requests Pool bigger to prevent [Errno -3] when socket stacked in CLOSE_WAIT status. 
adapter = requests.adapters.HTTPAdapter(pool_maxsize=(maxOpenFileLimit+100)) 
requestSessions.mount('http://', adapter) 
requestSessions.mount('https://', adapter) 

def threadAction(a1, a2): 
    global number 
    time.sleep(1) # My actions with Requests for each thread. 
    print number = number + 1 # DEBUG: This doesn't update number and wouldn't be 
           # thread safe if it did 

number = 0 # Count of complete actions 

pool = multiprocessing.pool.ThreadPool(50, chunksize=1) 

ThreadActions = [] # Action tasks. 
for i in range(50): # I have 50 websites I need to do in parallel threads. 
    a1 = i 
    for n in range(10): # Every website I need to do in 3 threads 
     a2 = n 
     ThreadActions.append((a1,a2)) 

pool.map(ThreadActons) 
pool.close() 
+0

多處理工作比線程更快嗎?這對處理器負載有何影響? – passwd

+0

它是一個權衡...和windows不同的是linux。使用多處理時,數據需要在父代和子代之間進行序列化(並且在Windows上,通常情況下,需要序列化更多的上下文,因爲孩子沒有得到父內存空間的克隆),但是您不必擔心通過GIL。更高的CPU和/或更低的數據開銷使得多處理效果更好。但是如果你主要是I/O綁定的話,線程池就可以。 – tdelaney