熊貓DataFrame內的JSON對象

我有一個熊貓數據框列中的JSON對象，我想拆分並放入其他列。在數據框中，JSON對象看起來像一個包含字典數組的字符串。該數組可以是可變長度的，包括零，或者該列甚至可以爲空。我寫了一些代碼，如下所示，這是我想要的。列名由兩個組件構成，第一個是字典中的鍵，第二個是字典中鍵值的子字符串。熊貓DataFrame內的JSON對象

此代碼工作正常，但在大數據框上運行時速度非常慢。任何人都可以提供更快（也可能更簡單）的方式來做到這一點？此外，如果您發現某些不合理/高效/ pythonic的東西，請隨時挑選我已完成的工作。我仍然是一個相對的初學者。感謝堆。

# Import libraries 
import pandas as pd 
from IPython.display import display # Used to display df's nicely in jupyter notebook. 
import json 

# Set some display options 
pd.set_option('max_colwidth',150) 

# Create the example dataframe 
print("Original df:") 
df = pd.DataFrame.from_dict({'ColA': {0: 123, 1: 234, 2: 345, 3: 456, 4: 567},\ 
'ColB': {0: '[{"key":"keyValue=1","valA":"8","valB":"18"},{"key":"keyValue=2","valA":"9","valB":"19"}]',\ 
    1: '[{"key":"keyValue=2","valA":"28","valB":"38"},{"key":"keyValue=3","valA":"29","valB":"39"}]',\ 
    2: '[{"key":"keyValue=4","valA":"48","valC":"58"}]',\ 
    3: '[]',\ 
    4: None}}) 
display(df) 

# Create a temporary dataframe to append results to, record by record 
dfTemp = pd.DataFrame() 

# Step through all rows in the dataframe 
for i in range(df.shape[0]): 
    # Check whether record is null, or doesn't contain any real data 
    if pd.notnull(df.iloc[i,df.columns.get_loc("ColB")]) and len(df.iloc[i,df.columns.get_loc("ColB")]) > 2: 
     # Convert the json structure into a dataframe, one cell at a time in the relevant column 
     x = pd.read_json(df.iloc[i,df.columns.get_loc("ColB")]) 
     # The last bit of this string (after the last =) will be used as a key for the column labels 
     x['key'] = x['key'].apply(lambda x: x.split("=")[-1]) 
     # Set this new key to be the index 
     y = x.set_index('key') 
     # Stack the rows up via a multi-level column index 
     y = y.stack().to_frame().T 
     # Flatten out the multi-level column index 
     y.columns = ['{1}_{0}'.format(*c) for c in y.columns] 
     # Give the single record the same index number as the parent dataframe (for the merge to work) 
     y.index = [df.index[i]] 
     # Append this dataframe on sequentially for each row as we go through the loop 
     dfTemp = dfTemp.append(y) 

# Merge the new dataframe back onto the original one as extra columns, with index mataching original dataframe 
df = pd.merge(df,dfTemp, how = 'left', left_index = True, right_index = True) 

print("Processed df:") 
display(df)

來源

2017-08-15 Michael

只是一件小事。您可以用'for i，col_b in enumerate（df.iloc [：，df.columns.get_loc（「ColB」）]）：'替換您的循環，並相應地更改對該條目的引用以提高可讀性。 – Nyps

謝謝！這當然會使它更加簡潔和可讀。 – Michael

首先，對熊貓的一般建議。 如果你發現自己遍歷數據幀的行，你很可能做錯了。

：

考慮到這一點，我們可以用大熊貓「應用」的方法（這可能會加速這一過程，首先，因爲它意味着對東風少得多的索引查找）重新寫你目前的程序

# Check whether record is null, or doesn't contain any real data 
def do_the_thing(row): 
    if pd.notnull(row) and len(row) > 2: 
     # Convert the json structure into a dataframe, one cell at a time in the relevant column 
     x = pd.read_json(row) 
     # The last bit of this string (after the last =) will be used as a key for the column labels 
     x['key'] = x['key'].apply(lambda x: x.split("=")[-1]) 
     # Set this new key to be the index 
     y = x.set_index('key') 
     # Stack the rows up via a multi-level column index 
     y = y.stack().to_frame().T 
     # Flatten out the multi-level column index 
     y.columns = ['{1}_{0}'.format(*c) for c in y.columns] 

     #we don't need to re-index 
      # Give the single record the same index number as the parent dataframe (for the merge to work) 
      #y.index = [df.index[i]] 
     #we don't need to add to a temp df 
     # Append this dataframe on sequentially for each row as we go through the loop 
     return y.iloc[0] 
    else: 
     return pd.Series() 
df2 = df.merge(df.ColB.apply(do_the_thing), how = 'left', left_index = True, right_index = True)

請注意，這返回與以前完全相同的結果，我們沒有更改邏輯。 apply方法對索引進行排序，所以我們可以合併，很好。

我相信在加快速度和更加習慣方面可以回答你的問題。

我認爲你應該考慮一下，然而，你想要用這個數據結構來做什麼，以及你如何更好地構造你正在做的事情。

考慮到ColB可以是任意長度的，你最終將得到一個任意數量的列的數據幀。當你爲了任何目的而訪問這些值時，無論目的是什麼，這都會導致你痛苦。

ColB中的所有條目都很重要嗎？你能保持第一個嗎？你需要知道某個valA val的索引嗎？

這些是你應該問問自己的問題，然後決定一個結構，這將允許你做任何你需要的分析，而不必檢查一堆任意的東西。

來源

2017-08-15 16:01:01

非常感謝您的全面迴應，非常感謝！你的代碼更簡單，更好，更容易重用。我實施了您的建議，並將執行時間縮短了20％。也感謝其他建議。我同意我的整體做法並不好。一種可能性是從列中創建一個新的數據框，用一個新的列來指定「關鍵」值。因此，我不會爲每個鍵值添加一組新的列，而是添加一組新的行。下次我會嘗試 - 如果我能弄清楚如何去做。 :-) – Michael

熊貓DataFrame內的JSON對象

回答

相關問題