我有腳本查看屬於一個組(REG_ID)的行和列標題並將這些值相加。該代碼上以矩陣(小的子集)中運行如下:根據所有組合的索引和分組創建內部python循環
我的代碼運行良好,用於計算基於行和列屬於每個內部組(REG_ID)的所有ID的總和。例如,對屬於REG_ID 1的所有行和列ID進行求和,以便計算區域1和區域1之間的總流量(內部流量),以此類推每個區域。 我希望通過計算(彙總)區域之間的流程來擴展此代碼,例如區域1到區域2,3,4,5 ...... 我想我需要在現有的while循環中包含另一個循環,但會真的很感謝幫助找出它應該在哪裏以及如何構建它。 (1-1 2-2 3-3等) 我的代碼目前運行在內部流動總和如下:
global index
index = 1
x = index
while index < len(idgroups):
ward_list = idgroups[index] #select list of ward ids for each region from list of lists
df6 = mergedcsv.loc[ward_list] #select rows with values in the list
dfcols = mergedcsv.loc[ward_list, :] #select columns with values in list
ward_liststr = map(str, ward_list) #convert ward_list to strings so that they can be used to select columns, won't work as integers.
ward_listint = map(int, ward_list)
#dfrowscols = mergedcsv.loc[ward_list, ward_listint]
df7 = df6.loc[:, ward_liststr]
print df7
regflowsum = df7.values.sum() #sum all values in dataframe
intflow = [regflowsum]
print intflow
dfintflow = pd.DataFrame(intflow)
dfintflow.reset_index(level=0, inplace=True)
dfintflow.columns = ["RegID", "regflowsum"]
dfflows.set_value(index, 'RegID', index)
dfflows.set_value(index, 'RegID2', index)
dfflows.set_value(index, 'regflow', regflowsum)
mergedcsv.set_value(ward_list, 'TotRegFlows', regflowsum)
index += 1 #increment index number
print dfflows
new_df = pd.merge(pairlist, dfflows, how='left', left_on=['origID','destID'], right_on = ['RegID', 'RegID2'])
print new_df #useful for checking dataframe merges
regionflows = r"C:\Temp\AllNI\regionflows.csv"
header = ["WardID","LABEL","REG_ID","Total","TotRegFlows"]
mergedcsv.to_csv(regionflows, columns = header, index=False)
regregflows = r"C:\Temp\AllNI\reg_regflows.csv"
headerreg = ["REG_ID_ORIG", "REG_ID_DEST", "FLOW"]
pairlistCSV = r"C:\Temp\AllNI\pairlist_regions.csv"
new_df.to_csv(pairlistCSV)
輸出如下:
idgroups數據幀:(見圖像1 - 圖像1)
DF7和intflows每個區域REG_ID的第二部分:(圖像1的第三部分 - 在右邊)
ddflows數據幀:)(圖像2的第四部分
並且最終輸出是new_df:(圖像2的第五部分)
我希望填充區域之間的所有可能的流量組合,而不僅僅是內部流量。
我想我需要在while循環中添加另一個循環。所以有可能增加一個枚舉功能,如:
while index < len(idgroups):
#add line(s) to calculate flows between regions
for index, item in enumerate(idgroups):
ward_list = idgroups[index]
print ward_list
df6 = mergedcsv.loc[ward_list] #select rows with values in the list
dfcols = mergedcsv.loc[ward_list, :] #select columns with values in list
ward_liststr = map(str, ward_list) #convert ward_list to strings so that they can be used to select columns, won't work as integers.
ward_listint = map(int, ward_list)
#dfrowscols = mergedcsv.loc[ward_list, ward_listint]
df7 = df6.loc[:, ward_liststr]
print df7
regflowsum = df7.values.sum() #sum all values in dataframe
intflow = [regflowsum]
print intflow
dfintflow = pd.DataFrame(intflow)
dfintflow.reset_index(level=0, inplace=True)
dfintflow.columns = ["RegID", "regflowsum"]
dfflows.set_value(index, 'RegID', index)
dfflows.set_value(index, 'RegID2', index)
dfflows.set_value(index, 'regflow', regflowsum)
mergedcsv.set_value(ward_list, 'TotRegFlows', regflowsum)
index += 1 #increment index number
我不能確定如何將項目集成,因此努力擴大代碼的所有組合。任何意見讚賞。基於流
更新功能:
w=pysal.rook_from_shapefile("C:/Temp/AllNI/NIW01_sort.shp",idVariable='LABEL')
Simil = pysal.open("C:/Temp/AllNI/simNI.csv")
Similarity = np.array(Simil)
db = pysal.open('C:\Temp\SQLite\MatrixCSV2.csv', 'r')
dbf = pysal.open(r'C:\Temp\AllNI\NIW01_sortC.dbf', 'r')
ids = np.array((dbf.by_col['LABEL']))
commuters = np.array((dbf.by_col['Total'],dbf.by_col['IDNO']))
commutersint = commuters.astype(int)
comm = commutersint[0]
floor = int(MIN_COM_CT + 100)
solution = pysal.region.Maxp(w=w,z=Similarity,floor=floor,floor_variable=comm)
regions = solution.regions
#print regions
writecsv = r"C:\Temp\AllNI\reg_output.csv"
csv = open(writecsv,'w')
csv.write('"LABEL","REG_ID"\n')
for i in range(len(regions)):
for lines in regions[i]:
csv.write('"' + lines + '","' + str(i+1) + '"\n')
csv.close()
flows = r"C:\Temp\SQLite\MatrixCSV2.csv"
regs = r"C:\Temp\AllNI\reg_output.csv"
wardflows = pd.read_csv(flows)
regoutput = pd.read_csv(regs)
merged = pd.merge(wardflows, regoutput)
#duplicate REG_ID column as the index to be used later
merged['REG_ID2'] = merged['REG_ID']
merged.to_csv("C:\Temp\AllNI\merged.csv", index=False)
mergedcsv = pd.read_csv("C:\Temp\AllNI\merged.csv",index_col='WardID_1') #index this dataframe using the WardID_1 column
flabelList = pd.read_csv("C:\Temp\AllNI\merged.csv", usecols = ["WardID", "REG_ID"]) #create list of all FLabel values
reg_id = "REG_ID"
ward_flows = "RegIntFlows"
flds = [reg_id, ward_flows] #create list of fields to be use in search
dict_ref = {} # create a dictionary with for each REG_ID a list of corresponding FLABEL fields
#group the dataframe by the REG_ID column
idgroups = flabelList.groupby('REG_ID')['WardID'].apply(lambda x: x.tolist())
print idgroups
idgrp_df = pd.DataFrame(idgroups)
csvcols = mergedcsv.columns
#create a list of column names to pass as an index to select columns
columnlist = list(mergedcsv.columns.values)
mergedcsvgroup = mergedcsv.groupby('REG_ID').sum()
mergedcsvgroup.describe()
idList = idgroups[2]
df4 = pd.DataFrame()
df5 = pd.DataFrame()
col_ids = idList #ward id no
regiddf = idgroups.index.get_values()
print regiddf
#total number of region ids
#print regiddf
#create pairlist combinations from region ids
#combinations with replacement allows for repeated items
#pairs = list(itertools.combinations_with_replacement(regiddf, 2))
pairs = list(itertools.product(regiddf, repeat=2))
#print len(pairs)
#create a new dataframe with pairlists and summed data
pairlist = pd.DataFrame(pairs,columns=['origID','destID'])
print pairlist.tail()
header_pairlist = ["origID","destID","flow"]
header_intflow = ["RegID", "RegID2", "regflow"]
dfflows = pd.DataFrame(columns=header_intflow)
print mergedcsv.index
print mergedcsv.dtypes
#mergedcsv = mergedcsv.select_dtypes(include=['int64'])
#print mergedcsv.columns
#mergedcsv.rename(columns = lambda x: int(x), inplace=True)
def flows():
pass
#def flows(mergedcsv, region_a, region_b):
def flows(mergedcsv, ward_lista, ward_listb):
"""Return the sum of all the cells in the row/column intersections
of ward_lista and ward_listb."""
mergedcsv = mergedcsv.loc[:, mergedcsv.dtypes == 'int64']
regionflows = mergedcsv.loc[ward_lista, ward_listb]
regionflowsum = regionflows.values.sum()
#grid = [ax, bx, regflowsuma, regflowsumb]
gridoutput = [ax, bx, regionflowsum]
print gridoutput
return regflowsuma
return regflowsumb
#print mergedcsv.index
#mergedcsv.columns = mergedcsv.columns.str.strip()
for ax, group_a in enumerate(idgroups):
ward_lista = map(int, group_a)
print ward_lista
for bx, group_b in enumerate(idgroups[ax:], start=ax):
ward_listb = map(int, group_b)
#print ward_listb
flow_ab = flows(mergedcsv, ward_lista, ward_listb)
#flow_ab = flows(mergedcsv, group_a, group_b)
這導致KeyError異常:
我試圖 '的[[189,197,198,201]]是在[列]無'同樣使用ward_lista = map(str,group_a)和map(int,group_a),但列出在dataframe.loc中找不到的對象。 這些列是混合數據類型,但包含應切片標籤的所有列都是int64類型。 我已經嘗試了許多圍繞數據類型的解決方案,但無濟於事。有什麼建議麼?
奧斯汀,謝謝。這是一個很大的幫助。我希望計算A到B和B到A的流量。我還沒有遇到流量功能,所以我們會研究這一點。 –
沒有流量功能。你必須使用你的計算來編寫它。 –
奧斯汀,你的建議真的很有幫助。添加流程功能非常合乎邏輯。我根據您的建議更新了代碼。我似乎有一個數據類型錯誤或索引錯誤。我已經嘗試了很多嘗試來解決這個問題,包括映射到int和選擇int64類型的所有列,但仍然有錯誤。你能發現任何明顯的錯誤嗎?非常感謝您的建議 –