2015-12-02 149 views
0

嘗試使用python讀取CSV文件時,遇到了路障。使用python讀取CSV文件時的編碼問題

UPDATE: 如果你只想跳過字符或錯誤,您可以打開該文件是這樣的:

with open(os.path.join(directory, file), 'r', encoding="utf-8", errors="ignore") as data_file: 

到目前爲止,我都試過了。

for directory, subdirectories, files in os.walk(root_dir): 
    for file in files: 
     with open(os.path.join(directory, file), 'r') as data_file: 
      reader = csv.reader(data_file) 
      for row in reader: 
       print (row) 

我得到的錯誤是:

UnicodeEncodeError: 'charmap' codec can't encode characters in position 224-225: character maps to <undefined> 

我已經試過

with open(os.path.join(directory, file), 'r', encoding="UTF-8") as data_file: 

錯誤:

UnicodeEncodeError: 'charmap' codec can't encode character '\u2026' in position 223: character maps to <undefined> 

現在,如果我只是打印DATA_FILE它說,他們是cp1252編碼,但如果我嘗試

with open(os.path.join(directory, file), 'r', encoding="cp1252") as data_file: 

我得到的錯誤是:

UnicodeEncodeError: 'charmap' codec can't encode characters in position 224-225: character maps to <undefined> 

我也試過推薦的包。

我得到的錯誤是:

UnicodeEncodeError: 'charmap' codec can't encode characters in position 224-225: character maps to <undefined> 

我試圖解析該生產線是:

2015-11-28 22:23:58,670805374291832832,479174464,"MarkCrawford15","RT @WhatTheFFacts: The tallest man in the world was Robert Pershing Wadlow of Alton, Illinois. He was slighty over 8 feet 11 inches tall.","None 

任何想法或幫助表示讚賞。

+0

CP1252,根據谷歌,是一個視窗字符編碼disscussed。你的環境是什麼,文件來自哪裏?例如,如果你用nano打開csv文件,它是否說它是dos格式? – Ogaday

+0

我不明白你在nano中打開文件的意思......我在一臺Windows機器上。 – user3271518

+0

噢,好的。我以爲你可能在Unix上 - 我以前在Linux上解析DOS格式的文件時遇到了麻煩,並認爲它可能是一個類似的問題。 Nano是Linux系統中常見的終端文本編輯器。 – Ogaday

回答

1

我會使用csvkit,它使用自適應編碼和解碼檢測。例如

import csvkit 
reader = csvkit.reader(data_file) 

如chat-溶液是 -

for directory, subdirectories, files in os.walk(root_dir): 
    for file in files: 
     with open(os.path.join(directory, file), 'r', encoding="utf-8") as data_file: 
      reader = csv.reader(data_file) 
      for row in reader: 
       data = [i.encode('ascii', 'ignore').decode('ascii') for i in row] 
       print (data) 
+0

謝謝你我沒有能力在我的環境中安裝軟件包目前 – user3271518

+0

你有沒有遇到過miniconda?它不需要管理員權限來使用。 – Ogaday

+0

'對於閱讀器中的行: data = [i。編碼('utf-8')爲我在行] 打印數據' – SIslam