2016-06-08 89 views
7

我正在尋找一種Python技術,以從熊貓數據框架中的平坦表格構建嵌套的JSON文件。例如怎麼可能是大熊貓的數據幀表,例如:如何使用平面數據表構建嵌套記錄的JSON文件?

teamname member firstname lastname orgname   phone  mobile 
0  1  0  John  Doe  Anon 916-555-1234     
1  1  1  Jane  Doe  Anon 916-555-4321 916-555-7890 
2  2  0 Mickey Moose Moosers 916-555-0000 916-555-1111 
3  2  1  Minny Moose Moosers 916-555-2222 

採取並出口到一個JSON看起來像:通過創建類型的字典字典和傾倒

{ 
"teams": [ 
{ 
"teamname": "1", 
"members": [ 
    { 
    "firstname": "John", 
    "lastname": "Doe", 
    "orgname": "Anon", 
    "phone": "916-555-1234", 
    "mobile": "", 
    }, 
    { 
    "firstname": "Jane", 
    "lastname": "Doe", 
    "orgname": "Anon", 
    "phone": "916-555-4321", 
    "mobile": "916-555-7890", 
    } 
] 
}, 
{ 
"teamname": "2", 
"members": [ 
    { 
    "firstname": "Mickey", 
    "lastname": "Moose", 
    "orgname": "Moosers", 
    "phone": "916-555-0000", 
    "mobile": "916-555-1111", 
    }, 
    { 
    "firstname": "Minny", 
    "lastname": "Moose", 
    "orgname": "Moosers", 
    "phone": "916-555-2222", 
    "mobile": "", 
    } 
] 
}  
] 

} 

我曾嘗試這樣做到JSON。這是我當前的代碼:

data = pandas.read_excel(inputExcel, sheetname = 'SCAT Teams', encoding = 'utf8') 
memberDictTuple = [] 

for index, row in data.iterrows(): 
    dataRow = row 
    rowDict = dict(zip(columnList[2:], dataRow[2:])) 

    teamRowDict = {columnList[0]:int(dataRow[0])} 

    memberId = tuple(row[1:2]) 
    memberId = memberId[0] 

    teamName = tuple(row[0:1]) 
    teamName = teamName[0] 

    memberDict1 = {int(memberId):rowDict} 
    memberDict2 = {int(teamName):memberDict1} 

    memberDictTuple.append(memberDict2) 

memberDictTuple = tuple(memberDictTuple) 
formattedJson = json.dumps(memberDictTuple, indent = 4, sort_keys = True) 
print formattedJson 

這產生了以下輸出。每個項目嵌套在「teamname」1或2下的正確級別,但如果記錄具有相同的組名,則記錄應嵌套在一起。我該如何解決這個問題,以便teamname 1和teamname 2每個嵌套2條記錄?

[ 
    { 
     "1": { 
      "0": { 
       "email": "[email protected]", 
       "firstname": "John", 
       "lastname": "Doe", 
       "mobile": "none", 
       "orgname": "Anon", 
       "phone": "916-555-1234" 
      } 
     } 
    }, 
    { 
     "1": { 
      "1": { 
       "email": "[email protected]", 
       "firstname": "Jane", 
       "lastname": "Doe", 
       "mobile": "916-555-7890", 
       "orgname": "Anon", 
       "phone": "916-555-4321" 
      } 
     } 
    }, 
    { 
     "2": { 
      "0": { 
       "email": "[email protected]", 
       "firstname": "Mickey", 
       "lastname": "Moose", 
       "mobile": "916-555-1111", 
       "orgname": "Moosers", 
       "phone": "916-555-0000" 
      } 
     } 
    }, 
    { 
     "2": { 
      "1": { 
       "email": "[email protected]", 
       "firstname": "Minny", 
       "lastname": "Moose", 
       "mobile": "none", 
       "orgname": "Moosers", 
       "phone": "916-555-2222" 
      } 
     } 
    } 
] 
+0

不幸的是,關於問題的高層次方法是好/正確/可能/等等的不幸的問題不在這裏討論。這就是說,我認爲口述的方法*看起來很有希望。你應該使用你的其他問題來解決其餘的細節問題,但記住要更新你收到的錯誤信息*以及你正在使用的代碼,以便它們同步(否則你的問題是不可複製的) 。 –

+0

我也試着調整這個答案:http://stackoverflow.com/questions/24374062/pandas-groupby-to-nested-json,但仍然沒有骰子。 – spaine

回答

1

這是一個可以工作並創建所需JSON格式的解決方案。首先,我將數據框分組爲適當的列,然後爲每個列標題/記錄對創建一個字典(並丟失數據順序),然​​後將它們創建爲元組列表,然後將列表轉換爲Ordered Dict。另一個Ordered Dict是爲其他所有內容分組的兩列創建的。爲了生成正確的格式,JSON轉換需要在列表和有序的字典之間進行精確分層。還要注意,當轉儲到JSON時,必須將sort_keys設置爲false,否則所有Ordered Dicts將按照字母順序重新排列。

import pandas 
import json 
from collections import OrderedDict 

inputExcel = 'E:\\teams.xlsx' 
exportJson = 'E:\\teams.json' 

data = pandas.read_excel(inputExcel, sheetname = 'SCAT Teams', encoding = 'utf8') 

# This creates a tuple of column headings for later use matching them with column data 
cols = [] 
columnList = list(data[0:]) 
for col in columnList: 
    cols.append(str(col)) 
columnList = tuple(cols) 

#This groups the dataframe by the 'teamname' and 'members' columns 
grouped = data.groupby(['teamname', 'members']).first() 

#This creates a reference to the index level of the groups 
groupnames = data.groupby(["teamname", "members"]).grouper.levels 
tm = (groupnames[0]) 

#Create a list to add team records to at the end of the first 'for' loop 
teamsList = [] 

for teamN in tm: 
    teamN = int(teamN) #added this in to prevent TypeError: 1 is not JSON serializable 
    tempList = [] #Create an temporary list to add each record to 
    for index, row in grouped.iterrows(): 
     dataRow = row 
     if index[0] == teamN: #Select the record in each row of the grouped dataframe if its index matches the team number 

      #In order to have the JSON records come out in the same order, I had to first create a list of tuples, then convert to and Ordered Dict 
      rowDict = ([(columnList[2], dataRow[0]), (columnList[3], dataRow[1]), (columnList[4], dataRow[2]), (columnList[5], dataRow[3]), (columnList[6], dataRow[4]), (columnList[7], dataRow[5])]) 
      rowDict = OrderedDict(rowDict) 
      tempList.append(rowDict) 
    #Create another Ordered Dict to keep 'teamname' and the list of members from the temporary list sorted 
    t = ([('teamname', str(teamN)), ('members', tempList)]) 
    t= OrderedDict(t) 

    #Append the Ordered Dict to the emepty list of teams created earlier 
    ListX = t 
    teamsList.append(ListX) 


#Create a final dictionary with a single item: the list of teams 
teams = {"teams":teamsList} 

#Dump to JSON format 
formattedJson = json.dumps(teams, indent = 1, sort_keys = False) #sort_keys MUST be set to False, or all dictionaries will be alphebetized 
formattedJson = formattedJson.replace("NaN", '"NULL"') #"NaN" is the NULL format in pandas dataframes - must be replaced with "NULL" to be a valid JSON file 
print formattedJson 

#Export to JSON file 
parsed = open(exportJson, "w") 
parsed.write(formattedJson) 

print"\n\nExport to JSON Complete" 
0

與@root一些輸入我用了不同的策略,並用下面的代碼,這似乎讓大多數有道路上來:

import pandas 
import json 
from collections import defaultdict 

inputExcel = 'E:\\teamsMM.xlsx' 
exportJson = 'E:\\teamsMM.json' 

data = pandas.read_excel(inputExcel, sheetname = 'SCAT Teams', encoding = 'utf8') 

grouped = data.groupby(['teamname', 'members']).first() 

results = defaultdict(lambda: defaultdict(dict)) 

for t in grouped.itertuples(): 
    for i, key in enumerate(t.Index): 
     if i ==0: 
      nested = results[key] 
     elif i == len(t.Index) -1: 
      nested[key] = t 
     else: 
      nested = nested[key] 


formattedJson = json.dumps(results, indent = 4) 

formattedJson = '{\n"teams": [\n' + formattedJson +'\n]\n }' 

parsed = open(exportJson, "w") 
parsed.write(formattedJson) 

產生的JSON文件是這樣的:

{ 
"teams": [ 
{ 
    "1": { 
     "0": [ 
      [ 
       1, 
       0 
      ], 
      "John", 
      "Doe", 
      "Anon", 
      "916-555-1234", 
      "none", 
      "[email protected]" 
     ], 
     "1": [ 
      [ 
       1, 
       1 
      ], 
      "Jane", 
      "Doe", 
      "Anon", 
      "916-555-4321", 
      "916-555-7890", 
      "[email protected]" 
     ] 
    }, 
    "2": { 
     "0": [ 
      [ 
       2, 
       0 
      ], 
      "Mickey", 
      "Moose", 
      "Moosers", 
      "916-555-0000", 
      "916-555-1111", 
      "[email protected]" 
     ], 
     "1": [ 
      [ 
       2, 
       1 
      ], 
      "Minny", 
      "Moose", 
      "Moosers", 
      "916-555-2222", 
      "none", 
      "[email protected]" 
     ] 
    } 
} 
] 
} 

這種格式非常接近想要的最終產品。剩下的問題是:刪除出現在每個名字上方的冗餘數組[1,0],並將每個巢的標題設置爲「teamname」:「1」, 「members」:而不是「1」:「0 「:

此外,我不知道爲什麼每個記錄都被剝離了轉換標題。例如,爲什麼字典輸入「firstname」:「John」導出爲「John」。

+0

請注意,爲了使此代碼正常工作,有必要從熊貓0.16.1升級到0.18.1。 – spaine

相關問題