JSON到CSV轉換器-YELP數據集-python

我對數據挖掘感興趣，我想打開並使用yelp的數據。 Yelp的數據是json格式，在它的網站上有以下代碼將json轉換爲csv。但是，當我打開命令行並寫入以下內容：JSON到CSV轉換器-YELP數據集-python

$ python json_to_csv_converter.py yelp_academic_dataset.json

我收到一條錯誤消息。你能幫我嗎？

的代碼是：

# -*- coding: utf-8 -*- 
"""Convert the Yelp Dataset Challenge dataset from json format to csv. 
For more information on the Yelp Dataset Challenge please visit http://yelp.com/dataset_challenge 
""" 
import argparse 
import collections 
import csv 
import simplejson as json 


def read_and_write_file(json_file_path, csv_file_path, column_names): 
    """Read in the json dataset file and write it out to a csv file, given the column names.""" 
    with open(csv_file_path, 'wb+') as fout: 
     csv_file = csv.writer(fout) 
     csv_file.writerow(list(column_names)) 
     with open(json_file_path) as fin: 
      for line in fin: 
       line_contents = json.loads(line) 
       csv_file.writerow(get_row(line_contents, column_names)) 

def get_superset_of_column_names_from_file(json_file_path): 
    """Read in the json dataset file and return the superset of column names.""" 
    column_names = set() 
    with open(json_file_path) as fin: 
     for line in fin: 
      line_contents = json.loads(line) 
      column_names.update(
        set(get_column_names(line_contents).keys()) 
        ) 
    return column_names 

def get_column_names(line_contents, parent_key=''): 
    """Return a list of flattened key names given a dict. 
    Example: 
     line_contents = { 
      'a': { 
       'b': 2, 
       'c': 3, 
       }, 
     } 
     will return: ['a.b', 'a.c'] 
    These will be the column names for the eventual csv file. 
    """ 
    column_names = [] 
    for k, v in line_contents.iteritems(): 
     column_name = "{0}.{1}".format(parent_key, k) if parent_key else k 
     if isinstance(v, collections.MutableMapping): 
      column_names.extend(
        get_column_names(v, column_name).items() 
        ) 
     else: 
      column_names.append((column_name, v)) 
    return dict(column_names) 

def get_nested_value(d, key): 
    """Return a dictionary item given a dictionary `d` and a flattened key from `get_column_names`. 

    Example: 
     d = { 
      'a': { 
       'b': 2, 
       'c': 3, 
       }, 
     } 
     key = 'a.b' 
     will return: 2 

    """ 
    if '.' not in key: 
     if key not in d: 
      return None 
     return d[key] 
    base_key, sub_key = key.split('.', 1) 
    if base_key not in d: 
     return None 
    sub_dict = d[base_key] 
    return get_nested_value(sub_dict, sub_key) 

def get_row(line_contents, column_names): 
    """Return a csv compatible row given column names and a dict.""" 
    row = [] 
    for column_name in column_names: 
     line_value = get_nested_value(
         line_contents, 
         column_name, 
         ) 
     if isinstance(line_value, unicode): 
      row.append('{0}'.format(line_value.encode('utf-8'))) 
     elif line_value is not None: 
      row.append('{0}'.format(line_value)) 
     else: 
      row.append('') 
    return row 

if __name__ == '__main__': 
    """Convert a yelp dataset file from json to csv.""" 

    parser = argparse.ArgumentParser(
      description='Convert Yelp Dataset Challenge data from JSON format to CSV.', 
      ) 

    parser.add_argument(
      'json_file', 
      type=str, 
      help='The json file to convert.', 
      ) 

    args = parser.parse_args() 

    json_file = args.json_file 
    csv_file = '{0}.csv'.format(json_file.split('.json')[0]) 

    column_names = get_superset_of_column_names_from_file(json_file) 
    read_and_write_file(json_file, csv_file, column_names)

錯誤，我在命令行獲得：

Traceback (most recent call last): 
File "json_to_csv_converter.py", line 122, in column_names=get_superset_of_column_names_from_file 
File "json_to_csv_converter.py", line 25, in get_superset_of_column_names_from_file 
for line in fin: 
File "C:\Users\Bengi\Appdata\Local\Programs\Python\Python35-32\lib\encodings\cp1252.py" line 23, in decode 
return codecs.charmap_decode(input, self_errors,decoding_table)[0] 
Unicode Decode Error: 'charmap' codec cant decode byte 0X9d in position 1102: character maps to

來源

2016-02-28 Bengi Koseoglu

您可以檢出該腳本。 https://github.com/rajbdilip/json-to-csv-converter。 –

由錯誤信息來看似乎是壞了你的輸入文件。它看起來像json_to_csv_converter.py已確定文件編碼是Windows 1252，但文件中有一個或多個無效字符，即'\x9d'，它不是有效的1252碼點。

檢查您的文件是否正確編碼。我猜想這個文件是UTF8編碼的，但是由於某種原因它正在被處理，就好像它是Windows 1252一樣。你編輯了文件嗎？

來源

2016-02-28 11:12:42 mhawke

我沒有編輯json_to_csv_converter代碼。首先，我從https://github.com/Yelp/dataset-examples/blob/master/json_to_csv_converter.py獲得這段代碼，然後將代碼複製並粘貼到一個python文件中並保存到python.exe所在的目錄中位於。其次，我下載了yelp數據集文件並將其保存到同一個目錄中。第三，我打開命令行並轉到python所在的目錄，並寫入以下行：python json_to_csv_converter.py yelp_academic_dataset.json，並得到以下錯誤。你認爲我做錯了嗎？ –

@BengiKoseoglu：我沒有看到你做錯了什麼。也許這與你解壓縮數據集的方式有關？我無法訪問數據，因此我無法分辨它是否有任何問題。 – mhawke

我在另一臺計算機上嘗試過，並試圖以不同的方式解壓縮該文件，但錯誤保持不變。所以我真的不知道該怎麼做。 –

你有文件編碼問題。你應該把encoding ='utf8'放在json文件打開的函數之後，例如：with open(json_file_path, encoding='utf8') as fin:

來源

2017-03-14 08:56:15

Winzip似乎以某種方式破壞它。我工作圍繞這個由：

使用7-Zip的提取tar文件。

編輯腳本強制使用UTF-8編碼，像這樣：

with open(json_file_path, encoding='utf8') as fin:

來源

2017-05-11 11:30:08

你能詳細解釋一下嗎？特別是，你能解釋「你有什麼問題」（什麼問題？）和「以正確的方式提取」（什麼，確切地說，是正確的方式？ – EJoshuaS

@EJoshuaS，我得到了上面提到的同樣的unicode錯誤。然後我發現我無法以正確的方式提取tar文件。我無法使用winzip解壓tar文件。當我使用7-zip時它工作。當你提取tar文件時，你應該會看到很多json文件，它們在[link]（https://www.yelp.com/dataset_challenge/dataset）中提到過。之後，當你打開json文件來閱讀，如果你使用'with open（json_file_path，encoding ='utf8'）fin：'它會工作:) –

你可以用這個編輯答案（而不僅僅是在註釋）？ – EJoshuaS

JSON到CSV轉換器-YELP數據集-python

回答

相關問題