2017-09-13 55 views
0

我正在建立一個進程,以「外部連接」兩個csv文件並將結果導出爲json對象。現在Python - 熊貓 - 如何從數據幀合併後從to_json刪除空值

# read the source csv files 
firstcsv = pandas.read_csv('file1.csv', names = ['main_index','attr_one','attr_two']) 
secondcsv = pandas.read_csv('file2.csv', names = ['main_index','attr_three','attr_four']) 

# merge them 
output = firstcsv.merge(secondcsv, on='main_index', how='outer') 

jsonresult = output.to_json(orient='records') 
print(jsonresult) 

,這兩個CSV文件中是這樣的:

file1.csv: 
1, aurelion, sol 
2, lee, sin 
3, cute, teemo 

file2.csv: 
1, midlane, mage 
2, jungler, melee 

而且我想生成的JSON喜歡輸出:

[{"main_index":1,"attr_one":"aurelion","attr_two":"sol","attr_three":"midlane","attr_four":"mage"}, 
{"main_index":2,"attr_one":"lee","attr_two":"sin","attr_three":"jungler","attr_four":"melee"}, 
{"main_index":3,"attr_one":"cute","attr_two":"teemo"}] 

相反,我越來越對我line with main_index = 3

{"main_index":3,"attr_one":"cute","attr_two":"teemo","attr_three":null,"attr_four":null}] 

在輸出中自動添加空值。 我想刪除他們 - 我環顧四周,但我找不到一個正確的方法來做到這一點。

希望有人能幫助我!

回答

1

因爲我們使用了一個數據幀,大熊貓將「填補」與NaN值,即

>>> print(output) 
     main_index attr_one attr_two attr_three attr_four 
0   1 aurelion  sol midlane  mage 
1   2  lee  sin jungler  melee 
2   3  cute teemo  NaN  NaN 

我不能看到pandas.to_json文檔中的任何選項來跳過空值:https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_json.html

所以我想出的方式涉及到重新構建JSON字符串。這可能是不適合的數百萬行的大型數據集高性能極(但有不到200個冠軍聯賽中如此不應該是一個巨大的問題!)

from collections import OrderedDict 
import json 

jsonresult = output.to_json(orient='records') 
# read the json string to get a list of dictionaries 
rows = json.loads(jsonresult) 

# new_rows = [ 
#  # rebuild the dictionary for each row, only including non-null values 
#  {key: val for key, val in row.items() if pandas.notnull(val)} 
#  for row in rows 
# ] 

# to maintain order use Ordered Dict 
new_rows = [ 
    OrderedDict([ 
     (key, row[key]) for key in output.columns 
     if (key in row) and pandas.notnull(row[key]) 
    ]) 
    for row in rows 
] 

new_json_output = json.dumps(new_rows) 

而且你會發現,new_json_output已降至所有鍵有NaN值,並保持順序:

>>> print(new_json_output) 
[{"main_index": 1, "attr_one": " aurelion", "attr_two": " sol", "attr_three": " midlane", "attr_four": " mage"}, 
{"main_index": 2, "attr_one": " lee", "attr_two": " sin", "attr_three": " jungler", "attr_four": " melee"}, 
{"main_index": 3, "attr_one": " cute", "attr_two": " teemo"}] 
+0

這個工作,但我失去的元素(比方說,我指定用reindex_axis方法的按訂單生產) 我想我需要使用的一些OrderedDict的順序排序,以保持排序... – Mik1893

+0

更新,以保持訂單 – Hazzles

+0

我只是找到它昨天唉,晚上......但非常感謝幫助! – Mik1893