Python Pandas：標記數據出錯。 C錯誤：讀取1GB CSV文件時字符串開始內部的EOF

我正在以10,000行的塊讀取1 GB CSV文件。該文件有1106012行和171列，其他較小的文件不顯示任何錯誤，併成功完成，但是當我讀這個1 GB的文件，它顯示錯誤，每次在正確的行號1106011這是文件的第二行，我可以手動刪除該行，但這不是解決方案，因爲我有數百個相同大小的其他文件，我無法手動修復所有行。任何人都可以幫我解決這個問題。Python Pandas：標記數據出錯。 C錯誤：讀取1GB CSV文件時字符串開始內部的EOF

def extract_csv_to_sql(input_file_name, header_row, size_of_chunk, eachRow): 

     df = pd.read_csv(input_file_name, 
         header=None, 
         nrows=size_of_chunk, 
         skiprows=eachRow, 
         low_memory=False, 
         error_bad_lines=False, 
         sep=',') 
         # engine='python' 
         # quoting=csv.QUOTE_NONE 
         # encoding='utf-8' 

     df.columns = header_row 
     df = df.drop_duplicates(keep='first') 
     df = df.apply(lambda x: x.astype(str).str.lower()) 

     return df

我然後在一個循環內調用這個函數，工作得很好。

huge_chunk_return = extract_csv_to_sql(huge_input_filename, huge_header_row, the_size_of_chunk_H, each_Row_H)

我讀這Pandas ParserError EOF character when reading multiple csv files to HDF5，這read_csv() & EOF character in string cause parsing issue這https://github.com/pandas-dev/pandas/issues/11654等等，並試圖包括read_csv參數如

engine='python'

quoting=csv.QUOTE_NONE // Hangs and even the python shell, don't know why

encoding='utf-8'

但沒有它的工作，它仍然拋出了以下錯誤

錯誤：

Traceback (most recent call last): 
    File "C:\Users\WCan\Desktop\wcan_new_python\pandas_test_3.py", line 115, in <module> 
    huge_chunk_return = extract_csv_to_sql(huge_input_filename, huge_header_row, the_size_of_chunk_H, each_Row_H) 
    File "C:\Users\WCan\Desktop\wcan_new_python\pandas_test_3.py", line 24, in extract_csv_to_sql 
    sep=',') 
    File "C:\Users\WCan\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\io\parsers.py", line 655, in parser_f 
    return _read(filepath_or_buffer, kwds) 
    File "C:\Users\WCan\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\io\parsers.py", line 411, in _read 
    data = parser.read(nrows) 
    File "C:\Users\WCan\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\io\parsers.py", line 1005, in read 
    ret = self._engine.read(nrows) 
    File "C:\Users\WCan\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\io\parsers.py", line 1748, in read 
    data = self._reader.read(nrows) 
    File "pandas\_libs\parsers.pyx", line 893, in pandas._libs.parsers.TextReader.read (pandas\_libs\parsers.c:10885) 
    File "pandas\_libs\parsers.pyx", line 966, in pandas._libs.parsers.TextReader._read_rows (pandas\_libs\parsers.c:11884) 
    File "pandas\_libs\parsers.pyx", line 953, in pandas._libs.parsers.TextReader._tokenize_rows (pandas\_libs\parsers.c:11755) 
    File "pandas\_libs\parsers.pyx", line 2184, in pandas._libs.parsers.raise_parser_error (pandas\_libs\parsers.c:28765) 
pandas.errors.ParserError: Error tokenizing data. C error: EOF inside string starting at line 1106011 
>>>

來源

2017-10-19 Wcan

你能告訴我們一個有效行和無效行（已刪除了倒數第二） – Indent

我不能粘貼在這裏它有171列，它看起來像正常行，但在大熊貓被讀它，它拋出上面提到的文件第二行的錯誤。 – Wcan

如果你在linux下，嘗試刪除所有不可打印的字符。嘗試在此操作後加載文件。

tr -dc '[:print:]\n' <file> newfile

來源

2017-10-19 09:03:51 Indent

我在windows下 – Wcan

我還能這麼做嗎？ – Wcan

https://stackoverflow.com/questions/92438/stripping-non-printable-characters-from-a-string-in-python（你可以試試這個解決方案） – Indent

Python Pandas：標記數據出錯。 C錯誤：讀取1GB CSV文件時字符串開始內部的EOF

回答

相關問題