2017-10-04 101 views
0

我使用的是FCC的API經/緯座標轉換成塊組代碼:熊貓和多

import pandas as pd 
import numpy as np 
import urllib 
import time 
import json 

# getup, getup1, and getup2 make up the url to the api 
getup = 'http://data.fcc.gov/api/block/find?format=json&latitude=' 

getup1 = '&longitude=' 

getup2 = '&showall=false' 

lat = ['40.7127837','34.0522342','41.8781136','29.7604267','39.9525839', 
'33.4483771','29.4241219','32.715738','32.7766642','37.3382082','30.267153', 
'39.768403','30.3321838','37.7749295','39.9611755','35.2270869', 
'32.7554883','42.331427','31.7775757','35.1495343'] 

long = ['-74.0059413','-118.2436849','-87.6297982','-95.3698028','-75.1652215', 
'-112.0740373','-98.4936282','-117.1610838','-96.7969879','-121.8863286', 
'-97.7430608','-86.158068','-81.655651','-122.4194155','-82.9987942', 
'-80.8431267','-97.3307658','-83.0457538','-106.4424559','-90.0489801'] 

#make lat and long in to a Pandas DataFrame 
latlong = pd.DataFrame([lat,long]).transpose() 
latlong.columns = ['lat','long'] 

new_list = [] 

def block(x): 
    for index,row in x.iterrows(): 
     #request url and read the output 
     a = urllib.request.urlopen(getup + row['lat'] + getup1 + row['long'] + getup2).read() 
     #load json output in to a form python can understand 
     a1 = json.loads(a) 
     #append output to an empty list. 
     new_list.append(a1['Block']['FIPS']) 

#call the function with latlong as the argument.   
block(latlong) 

#print the list, note: it is important that function appends to the list 
print(new_list) 

給出了這樣的輸出:

['360610031001021', '060372074001033', '170318391001104', '482011000003087', 
'421010005001010', '040131141001032', '480291101002041', '060730053003011', 
'481130204003064', '060855010004004', '484530011001092', '180973910003057', 
'120310010001023', '060750201001001', '390490040001005', '371190001005000', 
'484391233002071', '261635172001069', '481410029001001', '471570042001018'] 

與該腳本的問題是,我可以每行只調用一次api。腳本運行需要花費大約5分鐘的時間,這對於我計劃使用此腳本的1,000,000個條目來說是不可接受的。

我想用多處理來並行這個函數來減少運行函數的時間。我試圖查看多處理手冊,但一直未能弄清楚如何運行該函數並將輸出追加到並行的空列表中。

僅供參考:我正在使用python 3.6

任何指導都會很棒!

+0

嘿,你可能想看看在[python GIL](https://wiki.python.org/moin/GlobalInterpreterLock)。大多數時候在python中使用並行性會增加計算時間,而不是減少計算時間。 – Tbaki

+0

既然你是IO綁定的,線程在這裏是有意義的,將不得不重構你的問題,以避免追加到全局列表。 Docs這裏是一個很好的開始 - https://docs.python.org/3/library/concurrent.futures.html#threadpoolexecutor-example – chrisb

+0

@Tbaki'multiprocessing'不受GIL的影響,實際上它是爲了創建提供'線程'式的api來創建多個進程來*旁路* GIL的限制。正如@chrisb指出的那樣,儘管由於這個代碼是IO綁定的,所以'線程'不會被GIL限制。 –

回答

1

您不必自己實現並行性,有比urllib更好的庫,例如,請求[0]和一些使用線程或期貨的分拆[1]。我想你需要檢查自己哪個是最快的。

由於依賴性很少,我喜歡request-futures最好的,在這裏我使用十個線程來實現您的代碼。圖書館甚至支持過程如果您認爲或弄清楚,它是在你的情況下,莫名其妙地更好:

import pandas as pd 
import numpy as np 
import urllib 
import time 
import json 
from concurrent.futures import ThreadPoolExecutor 

from requests_futures.sessions import FuturesSession 

#getup, getup1, and getup2 make up the url to the api 
getup = 'http://data.fcc.gov/api/block/find?format=json&latitude=' 

getup1 = '&longitude=' 

getup2 = '&showall=false' 

lat = ['40.7127837','34.0522342','41.8781136','29.7604267','39.9525839', 
'33.4483771','29.4241219','32.715738','32.7766642','37.3382082','30.267153', 
'39.768403','30.3321838','37.7749295','39.9611755','35.2270869', 
'32.7554883','42.331427','31.7775757','35.1495343'] 

long = ['-74.0059413','-118.2436849','-87.6297982','-95.3698028','-75.1652215', 
'-112.0740373','-98.4936282','-117.1610838','-96.7969879','-121.8863286', 
'-97.7430608','-86.158068','-81.655651','-122.4194155','-82.9987942', 
'-80.8431267','-97.3307658','-83.0457538','-106.4424559','-90.0489801'] 

#make lat and long in to a Pandas DataFrame 
latlong = pd.DataFrame([lat,long]).transpose() 
latlong.columns = ['lat','long'] 

def block(x): 
    requests = [] 
    session = FuturesSession(executor=ThreadPoolExecutor(max_workers=10)) 
    for index, row in x.iterrows(): 
     #request url and read the output 
     url = getup+row['lat']+getup1+row['long']+getup2   
     requests.append(session.get(url)) 
    new_list = [] 
    for request in requests: 
     #load json output in to a form python can understand 
     a1 = json.loads(request.result().content) 
     #append output to an empty list. 
     new_list.append(a1['Block']['FIPS']) 
    return new_list 

#call the function with latlong as the argument.   
new_list = block(latlong) 

#print the list, note: it is important that function appends to the list 
print(new_list) 

[0] http://docs.python-requests.org/en/master/

[1] https://github.com/kennethreitz/grequests

+0

這工作得很好!我從千分之五分鐘到千分之一分鐘。 –

+0

那麼,如果你接受我的回答,我會很高興:) – mkastner