2017-04-06 74 views
0

我有腳本查看屬於一個組(REG_ID)的行和列標題並將這些值相加。該代碼上以矩陣(小的子集)中運行如下:根據所有組合的索引和分組創建內部python循環

Outputs

我的代碼運行良好,用於計算基於行和列屬於每個內部組(REG_ID)的所有ID的總和。例如,對屬於REG_ID 1的所有行和列ID進行求和,以便計算區域1和區域1之間的總流量(內部流量),以此類推每個區域。 我希望通過計算(彙總)區域之間的流程來擴展此代碼,例如區域1到區域2,3,4,5 ...... 我想我需要在現有的while循環中包含另一個循環,但會真的很感謝幫助找出它應該在哪裏以及如何構建它。 (1-1 2-2 3-3等) 我的代碼目前運行在內部流動總和如下:

global index 
index = 1 
x = index 
while index < len(idgroups): 
    ward_list = idgroups[index] #select list of ward ids for each region from list of lists 
    df6 = mergedcsv.loc[ward_list] #select rows with values in the list 
    dfcols = mergedcsv.loc[ward_list, :] #select columns with values in list 
    ward_liststr = map(str, ward_list) #convert ward_list to strings so that they can be used to select columns, won't work as integers. 
    ward_listint = map(int, ward_list) 
    #dfrowscols = mergedcsv.loc[ward_list, ward_listint] 
    df7 = df6.loc[:, ward_liststr] 
    print df7 
    regflowsum = df7.values.sum() #sum all values in dataframe 
    intflow = [regflowsum] 
    print intflow 
    dfintflow = pd.DataFrame(intflow) 
    dfintflow.reset_index(level=0, inplace=True) 
    dfintflow.columns = ["RegID", "regflowsum"] 
    dfflows.set_value(index, 'RegID', index) 
    dfflows.set_value(index, 'RegID2', index) 
    dfflows.set_value(index, 'regflow', regflowsum) 
    mergedcsv.set_value(ward_list, 'TotRegFlows', regflowsum) 
    index += 1 #increment index number 
print dfflows 
new_df = pd.merge(pairlist, dfflows, how='left', left_on=['origID','destID'], right_on = ['RegID', 'RegID2']) 
print new_df #useful for checking dataframe merges 
regionflows = r"C:\Temp\AllNI\regionflows.csv" 
header = ["WardID","LABEL","REG_ID","Total","TotRegFlows"] 
mergedcsv.to_csv(regionflows, columns = header, index=False) 
regregflows = r"C:\Temp\AllNI\reg_regflows.csv" 
headerreg = ["REG_ID_ORIG", "REG_ID_DEST", "FLOW"] 

pairlistCSV = r"C:\Temp\AllNI\pairlist_regions.csv" 
new_df.to_csv(pairlistCSV) 

輸出如下:

idgroups數據幀:(見圖像1 - 圖像1)

DF7和intflows每個區域REG_ID的第二部分:(圖像1的第三部分 - 在右邊)

ddflows數據幀:)(圖像2的第四部分

並且最終輸出是new_df:(圖像2的第五部分)

我希望填充區域之間的所有可能的流量組合,而不僅僅是內部流量。

我想我需要在while循環中添加另一個循環。所以有可能增加一個枚舉功​​能,如:

while index < len(idgroups): 
    #add line(s) to calculate flows between regions 
    for index, item in enumerate(idgroups): 
     ward_list = idgroups[index] 
     print ward_list 
     df6 = mergedcsv.loc[ward_list] #select rows with values in the list 
     dfcols = mergedcsv.loc[ward_list, :] #select columns with values in list 
     ward_liststr = map(str, ward_list) #convert ward_list to strings so that they can be used to select columns, won't work as integers. 
     ward_listint = map(int, ward_list) 
     #dfrowscols = mergedcsv.loc[ward_list, ward_listint] 
     df7 = df6.loc[:, ward_liststr] 
     print df7 
     regflowsum = df7.values.sum() #sum all values in dataframe 
     intflow = [regflowsum] 
     print intflow 
     dfintflow = pd.DataFrame(intflow) 
     dfintflow.reset_index(level=0, inplace=True) 
     dfintflow.columns = ["RegID", "regflowsum"] 
     dfflows.set_value(index, 'RegID', index) 
     dfflows.set_value(index, 'RegID2', index) 
     dfflows.set_value(index, 'regflow', regflowsum) 
     mergedcsv.set_value(ward_list, 'TotRegFlows', regflowsum) 
     index += 1 #increment index number 

我不能確定如何將項目集成,因此努力擴大代碼的所有組合。任何意見讚賞。基於流

更新功能:

w=pysal.rook_from_shapefile("C:/Temp/AllNI/NIW01_sort.shp",idVariable='LABEL') 
Simil = pysal.open("C:/Temp/AllNI/simNI.csv") 
Similarity = np.array(Simil) 
db = pysal.open('C:\Temp\SQLite\MatrixCSV2.csv', 'r') 
dbf = pysal.open(r'C:\Temp\AllNI\NIW01_sortC.dbf', 'r') 
ids = np.array((dbf.by_col['LABEL'])) 
commuters = np.array((dbf.by_col['Total'],dbf.by_col['IDNO'])) 
commutersint = commuters.astype(int) 
comm = commutersint[0] 
floor = int(MIN_COM_CT + 100) 
solution = pysal.region.Maxp(w=w,z=Similarity,floor=floor,floor_variable=comm) 
regions = solution.regions 
#print regions 
writecsv = r"C:\Temp\AllNI\reg_output.csv" 
csv = open(writecsv,'w') 
csv.write('"LABEL","REG_ID"\n') 
for i in range(len(regions)): 
     for lines in regions[i]: 
      csv.write('"' + lines + '","' + str(i+1) + '"\n') 
csv.close() 
flows = r"C:\Temp\SQLite\MatrixCSV2.csv" 
regs = r"C:\Temp\AllNI\reg_output.csv" 
wardflows = pd.read_csv(flows) 
regoutput = pd.read_csv(regs) 
merged = pd.merge(wardflows, regoutput) 
#duplicate REG_ID column as the index to be used later 
merged['REG_ID2'] = merged['REG_ID'] 
merged.to_csv("C:\Temp\AllNI\merged.csv", index=False) 
mergedcsv = pd.read_csv("C:\Temp\AllNI\merged.csv",index_col='WardID_1') #index this dataframe using the WardID_1 column 
flabelList = pd.read_csv("C:\Temp\AllNI\merged.csv", usecols = ["WardID", "REG_ID"]) #create list of all FLabel values 

reg_id = "REG_ID" 
ward_flows = "RegIntFlows" 
flds = [reg_id, ward_flows] #create list of fields to be use in search 

dict_ref = {} # create a dictionary with for each REG_ID a list of corresponding FLABEL fields 


#group the dataframe by the REG_ID column 
idgroups = flabelList.groupby('REG_ID')['WardID'].apply(lambda x: x.tolist()) 
print idgroups 

idgrp_df = pd.DataFrame(idgroups) 

csvcols = mergedcsv.columns 

#create a list of column names to pass as an index to select columns 
columnlist = list(mergedcsv.columns.values) 

mergedcsvgroup = mergedcsv.groupby('REG_ID').sum() 
mergedcsvgroup.describe() 
idList = idgroups[2] 
df4 = pd.DataFrame() 
df5 = pd.DataFrame() 
col_ids = idList #ward id no 

regiddf = idgroups.index.get_values() 
print regiddf 
#total number of region ids 
#print regiddf 
#create pairlist combinations from region ids 
#combinations with replacement allows for repeated items 
#pairs = list(itertools.combinations_with_replacement(regiddf, 2)) 
pairs = list(itertools.product(regiddf, repeat=2)) 
#print len(pairs) 

#create a new dataframe with pairlists and summed data 
pairlist = pd.DataFrame(pairs,columns=['origID','destID']) 
print pairlist.tail() 
header_pairlist = ["origID","destID","flow"] 
header_intflow = ["RegID", "RegID2", "regflow"] 
dfflows = pd.DataFrame(columns=header_intflow) 

print mergedcsv.index 
print mergedcsv.dtypes 
#mergedcsv = mergedcsv.select_dtypes(include=['int64']) 
#print mergedcsv.columns 
#mergedcsv.rename(columns = lambda x: int(x), inplace=True) 

def flows(): 
    pass 

#def flows(mergedcsv, region_a, region_b): 
def flows(mergedcsv, ward_lista, ward_listb): 
    """Return the sum of all the cells in the row/column intersections 
    of ward_lista and ward_listb.""" 

    mergedcsv = mergedcsv.loc[:, mergedcsv.dtypes == 'int64'] 
    regionflows = mergedcsv.loc[ward_lista, ward_listb] 
    regionflowsum = regionflows.values.sum() 


    #grid = [ax, bx, regflowsuma, regflowsumb] 
    gridoutput = [ax, bx, regionflowsum] 
    print gridoutput 

    return regflowsuma 
    return regflowsumb 

#print mergedcsv.index 

#mergedcsv.columns = mergedcsv.columns.str.strip() 

for ax, group_a in enumerate(idgroups): 
    ward_lista = map(int, group_a) 
    print ward_lista 


    for bx, group_b in enumerate(idgroups[ax:], start=ax): 
     ward_listb = map(int, group_b) 
     #print ward_listb 

     flow_ab = flows(mergedcsv, ward_lista, ward_listb) 
      #flow_ab = flows(mergedcsv, group_a, group_b) 

這導致KeyError異常:

我試圖 '的[[189,197,198,201]]是在[列]無'同樣使用ward_lista = map(str,group_a)和map(int,group_a),但列出在dataframe.loc中找不到的對象。 這些列是混合數據類型,但包含應切片標籤的所有列都是int64類型。 我已經嘗試了許多圍繞數據類型的解決方案,但無濟於事。有什麼建議麼?

回答

0

我不能說你正在做的計算,但它好像你只是想安排組的組合。問題是它們是否是有向或無向的 - 也就是說,是否需要計算流量(A,B)和流量(B,A),還是隻計算一個流量?

如果只有一個,你可以這樣做:

for i,ward_list in enumerate(idgroups): 
    for j,ward_list2 in enumerate(idgroups[i:],start=i): 

這將遍歷我,像Ĵ對:

0,0 0,1 0,2 ... 0,n 
1,1 1,2 ... 1,n 
2,2 ... 2,n 

這將成爲在無向的情況下。如果你需要計算流量(A,B)和流量(B,A),那麼你可以簡單地將你的代碼推入一個名爲flows的函數中,並用反向參數調用它,如圖所示。 ;-)

更新

讓我們定義一個名爲flows功能:

def flows(): 
    pass 

現在,有什麼參數?

好吧,看着你的代碼,它從DataFrame獲取數據。你需要兩個不同的病房,所以我們從這些開始。結果似乎是結果網格的總和。

def flows(df, ward_a, ward_b): 
    """Return the sum of all the cells in the row/column intersections 
    of ward_a and ward_b.""" 

    return 0 

現在我複製你的代碼行:

ward_list = idgroups[index] 
    print ward_list 
    df6 = mergedcsv.loc[ward_list] #select rows with values in the list 
    dfcols = mergedcsv.loc[ward_list, :] #select columns with values in list 
    ward_liststr = map(str, ward_list) #convert ward_list to strings so that they can be used to select columns, won't work as integers. 
    ward_listint = map(int, ward_list) 
    #dfrowscols = mergedcsv.loc[ward_list, ward_listint] 
    df7 = df6.loc[:, ward_liststr] 
    print df7 
    regflowsum = df7.values.sum() #sum all values in dataframe 
    intflow = [regflowsum] 
    print intflow 

我想這是大多數flow功能的就在這裏。我們看看吧。

  1. ward_list顯然會任ward_award_b參數。

  2. 我不確定df6是什麼,因爲你在df7重新計算它。所以需要澄清。

  3. regflowsum是我們期望的輸出,我想。

,改寫本進入功能:

def flows(df, ward_a, ward_b): 
    """Return the sum of all the cells in the row/column intersections 
    of ward_a and ward_b.""" 

    print "Computing flows from:" 
    print " ", ward_a 
    print "" 
    print "flows into:" 
    print " ", ward_b 

    # Filter rows by ward_a, cols by ward_b: 
    grid = df.loc[ward_a, ward_b] 

    print "Grid:" 
    print grid 

    flowsum = grid.values.sum() 

    print "Flows:", flowsum 

    return flowsum 

現在,我已經假定ward_award_b值已經在正確的格式。所以我們不得不str - 它們或任何功能外。讓我們這樣做:

for ax, group_a in enumerate(idgroups): 
    ward_a = map(str, group_a) 

    for bx, group_b in enumerate(idgroups[ax:], start=ax): 
     ward_b = map(str, group_b) 

     flow_ab = flows(mergedcsv, ward_a, ward_b) 

     if ax != bx: 
      flow_ba = flows(mergedcsv, ward_b, ward_a) 
     else: 
      flow_ba = flow_ab 

     # Now what? 

在這一點上,你有兩個數字。當病房相同時(內部流量?),它們將是平等的。在這一點上,你原來的代碼不再有用,因爲它只處理內部流,而不是A-> B流,所以我不知道該怎麼做。但值是在變量,所以...

+0

奧斯汀,謝謝。這是一個很大的幫助。我希望計算A到B和B到A的流量。我還沒有遇到流量功能,所以我們會研究這一點。 –

+0

沒有流量功能。你必須使用你的計算來編寫它。 –

+0

奧斯汀,你的建議真的很有幫助。添加流程功能非常合乎邏輯。我根據您的建議更新了代碼。我似乎有一個數據類型錯誤或索引錯誤。我已經嘗試了很多嘗試來解決這個問題,包括映射到int和選擇int64類型的所有列,但仍然有錯誤。你能發現任何明顯的錯誤嗎?非常感謝您的建議 –