所以我有一個是結構如下方式大熊貓據幀條件:的Python:將字符串分割在其他列表元素
In: df.head(1)
Out:
Individual Employer EmployerState BranchesState BranchesNr
872570 (4210, 7463, 23130, 133752) (MN, GA, NY, AZ) (MN, AZ, GA, AZ, NY, AZ, AZ) (0, 1, 0, 1, 0, 1, 0)
現在我打算做的是分裂所有多個用人單位的信息,並創建一個每個僱主和僱員對單個記錄,像這樣:
Individual Employer EmployerState BranchesState BranchesNr
872570 4210 MN MN, AZ 0, 1
872570 7463 GA GA, AZ 0, 1
872570 23130 NY NY, AZ 0, 1
872570 133752 AZ AZ 0
目前,我能夠通過將做到這一點對列個人,僱主和EmployerState下面的代碼:
rows = [] # Store individuals in empty array
for _, row in indv_sub.iterrows():
# If there are multiple employers
# Example:
# Individual | Employer => Individual | Employer
# 123 | (XY, AB) 123 | XY
# 123 | AB
if len(str(row['Employer']).split(','))>1:
# split the individual record into as many employers as an individual has
[rows.append(
[row['Individual'],
m.replace('(','').replace(')',''),
l.replace('(','').replace(')',''),
row['BranchesState']])
for m,l in zip(row['Employer'].split(','),row['EmployerState'].split(','))]
else:
# just add the single employer
rows.append([row['Individual'], row['Employer'], row['EmployerState'], row['BranchesState']])
indv_relevant = pd.DataFrame(rows,columns=('Individual','Employer','EmployerState','BranchesState'))
indv_relevant = indv_relevant.convert_objects(convert_numeric=True)
這工作得很好,但我不能很好地拆分BranchesState列。我添加了一個BranchesNr字段,用於指示下一個僱主的分支。因此,考慮這個例子:
Employer BranchesState BranchesNr
(MN, GA, NY, AZ) (MN, AZ, GA, AZ, NY, AZ, AZ) (0, 1, 0, 1, 0, 1, 0)
的第一個值是0,1後面是0,這表明所有到第二位置的分支屬於第一個僱主。
list(row['BranchesState'].split(','))[:2] # would be attributable to the first employer
接下來是位置3到4,這歸因於第二僱主等等。我不太清楚如何很好地實現它。任何想法或建議?
P.S:字段是字符串而不是元組/列表。另外0,1,0只是一個例子,一些序列是0,1,2,0,1,0,1,2,3,4等。
要包括的數據的更多的變化,這裏是10個觀察值的示例:
{u'BrnchOfLoc_FirmNr ':{1490:U'(0,0) ' 1498:U'(0, 0,0,1,0'), 1594:u'(0,0)', 1618:u'(0,0,0)', 1632:u'(0,0)', 1633:u '(0,0)', 1687:u'(0,0)', 1738:u'(0,0)', 1783:u'(0,0,1)', 1793:u '(0,0)'}, u'BrnchOfLoc_state':{1490:u'(CA,CA)', 1498:u'(CA,CA,CA,CA)', 1594:u' ,PA)', u'(FL,FL)', 1618:u'(CA,CA,CA)', 1632:u'(NY,NY)', 1633:u'(NH,NH)', 1687: 1738:u'(CA,CA)', 1783:u'(MS,MS,LA)', 1793:u'(NJ,NJ)'', u'CrntEmp_orgPK':{1490:u' (13572,144875)', 1498:u'(112059,137743)', 1594:u'(519,162200)', 1618:u'(23131,111532,113269)', 1632:u' (6627,118660)', 1633:u'(6413,131406)', 1687:u'(131587,142133)', 1738:u'(23131,105698)', 1783:u'(159778 ,160431)', 1793:u'(6413,128859)'},(CA,CA)',{'1490:u'(CA,CA)', 1498:u'(CA,CA)', 1594:u'(PA,PA)', 1618: CA,CA)', 1632:u'(NY,NY)', 1633:u'(MA,NH)', 1687:u'(FL,FL)', 1738:u' CA)', 1783:u'(MS,LA)', 1793:u'(MA,NJ)'', u'Info_indv1PK':{1490:u'731003', 1498:u'29443' , 1594:u'708024' , 1618:u'707057' , 1632:u'830502' , 1633:u'854101' , 1687:u'706344' , 1738:u'867229' , 1783:u'734227', 1793 :u'849856' }, 'NumberEmployer':{1490:2, 1498:2, 1594:2, 1618:3,1632 :2, 1633:2, 1687:2, 1738: 2, 1783:2, 1793:2}}
您能否提供一個較小的示例,顯示給定輸入的準確輸出?我並不十分清楚這些分支應該如何工作,而完整的樣本會有所幫助。另外,將示例數據框的代碼放在一起可以幫助人們回答。 – ASGM
我使列名更易於解釋並擴展了示例。這有幫助嗎? – chizze
'df.head()。to_dict('list')''?在數據中看到更多的變化是很好的。 – Alexander