2017-02-26 300 views
1

我使用蟒蛇外殼組數據analitycs Uploding到大熊貓一堆Excel文件(25個文件) 在此文件https://www.dropbox.com/s/16ea1cw6k63i16p/Newdata.zip?dl=0 我得到的錯誤。不能找到爲什麼以及如何修復它的原因。Pandas.read_excel KeyError異常閱讀的XLSX文件

import pandas as pd 
import numpy as np 
import os 

os.chdir(r"C:\Users\Twentyouts\Desktop\Newdata") 
path = os.getcwd() 

files = os.listdir(path) 
files_xlsx = [f for f in files if f[-4:] == 'xlsx'] 

for f in files_xlsx: 
    print(f) 
    loading = pd.read_excel(f, heading = 0) 
    df = df.append(loading) 
2016-06-20–2016-06-26.xlsx 
2016-06-27–2016-07-03.xlsx 
2016-07-04–2016-07-10.xlsx 
2016-07-11–2016-07-17.xlsx 
2016-08-01–2016-08-07.xlsx 
2016-08-15–2016-08-21.xlsx 

KeyError         Traceback (most recent call last) 
<ipython-input-23-5737d4d13b9f> in <module>() 
     1 df = pd.DataFrame() 
----> 2 pd.read_excel('2016-08-15–2016-08-21.xlsx') 

C:\Users\Twentyouts\Anaconda3\lib\site-packages\pandas\io\excel.py in read_excel(io, sheetname, header, skiprows, skip_footer, index_col, names, parse_cols, parse_dates, date_parser, na_values, thousands, convert_float, has_index_names, converters, true_values, false_values, engine, squeeze, **kwds) 
    189 
    190  if not isinstance(io, ExcelFile): 
--> 191   io = ExcelFile(io, engine=engine) 
    192 
    193  return io._parse_excel(

C:\Users\Twentyouts\Anaconda3\lib\site-packages\pandas\io\excel.py in __init__(self, io, **kwds) 
    247    self.book = xlrd.open_workbook(file_contents=data) 
    248   elif isinstance(io, compat.string_types): 
--> 249    self.book = xlrd.open_workbook(io) 
    250   else: 
    251    raise ValueError('Must explicitly set engine if not passing in' 

C:\Users\Twentyouts\Anaconda3\lib\site-packages\xlrd\__init__.py in open_workbook(filename, logfile, verbosity, use_mmap, file_contents, encoding_override, formatting_info, on_demand, ragged_rows) 
    420     formatting_info=formatting_info, 
    421     on_demand=on_demand, 
--> 422     ragged_rows=ragged_rows, 
    423    ) 
    424    return bk 

C:\Users\Twentyouts\Anaconda3\lib\site-packages\xlrd\xlsx.py in open_workbook_2007_xml(zf, component_names, logfile, verbosity, use_mmap, formatting_info, on_demand, ragged_rows) 
    831   x12sheet = X12Sheet(sheet, logfile, verbosity) 
    832   heading = "Sheet %r (sheetx=%d) from %r" % (sheet.name, sheetx, fname) 
--> 833   x12sheet.process_stream(zflo, heading) 
    834   del zflo 
    835 

C:\Users\Twentyouts\Anaconda3\lib\site-packages\xlrd\xlsx.py in own_process_stream(self, stream, heading) 
    546   for event, elem in ET.iterparse(stream): 
    547    if elem.tag == row_tag: 
--> 548     self_do_row(elem) 
    549     elem.clear() # destroy all child elements (cells) 
    550    elif elem.tag == U_SSML12 + "dimension": 

C:\Users\Twentyouts\Anaconda3\lib\site-packages\xlrd\xlsx.py in do_row(self, row_elem) 
    743      else: 
    744       bad_child_tag(child_tag) 
--> 745     value = error_code_from_text[tvalue] 
    746     self.sheet.put_cell(rowx, colx, XL_CELL_ERROR, value, xf_index) 
    747    elif cell_type == "inlineStr": 

KeyError: None 
+1

我無法重現您的錯誤。 'x = pd.read_excel('D:\\ download \\ 2016-08-152016-08-21.xlsx')' - 對我來說工作得很好(Pandas v 0.19.2)。你的Pandas版本是什麼? – MaxU

+0

嘗試了相同的文件。完美加載,沒有錯誤。 但是,當我加載一堆文件錯誤發生。 我使用木星。 –

+0

爲了能夠幫助你,我們需要一個__reproducible__ excel文件(s)... – MaxU

回答

1

事實上,正如@MaxU指出,Excel文件是格式錯誤但有趣的是,在正確保存爲.xls時會解決問題x文件。可能地,通過簡單地將擴展名改爲.xlsx,試圖從先前的.xls版本升級無效文件。這兩種文件格式不是簡單的文本文件,可以更改擴展名而不會造成危害,但二進制格式非常不同

考慮使用wn32com模塊運行COM接口,以使用Excel的Workbook.SaveAs方法將格式錯誤的文件正確保存到實際的OpenXML工作簿中。注意:此解決方案僅適用於安裝了MS Excel的Windows for Python。

import pandas as pd 
import glob 
import win32com.client as win32 

xlsxfiles = glob.glob("C:\\Path\\To\\Workbooks\\*.xlsx") 

def save_xlsx(srcfile): 
    try: 
     newfile = srcfile.replace('.xlsx', '_new.xlsx') 
     print('Malformed file saved as {}'.format(newfile)) 
     xlApp = win32.gencache.EnsureDispatch('Excel.Application') 
     wb = xlApp.Workbooks.Open(srcfile) 
     wb.SaveAs(newfile, 51)     

    except Exception as e: 
     print(e)    
    finally: 
     wb.Close(True); wb = None 
     xlApp.Quit; xlApp = None  
    return newfile 

def xl_read():  
    dfs = [] 
    for f in xlsxfiles:   
     try: 
      df = pd.read_excel(f) 
     except Exception as e:    
      df = pd.read_excel(save_xlsx(f)) 

     print('File: {}, Shape: {}'.format(f, df.shape)) 
     dfs.append(df)    
    return pd.concat(dfs) 

print('Final dataframe shape: {}'.format(xl_read().shape)) 

輸出

File: C:\Path\To\Workbooks\2016-06-20–2016-06-26.xlsx, Shape: (5912, 27) 
File: C:\Path\To\Workbooks\2016-06-27–2016-07-03.xlsx, Shape: (5362, 27) 
File: C:\Path\To\Workbooks\2016-07-04–2016-07-10.xlsx, Shape: (5387, 27) 
File: C:\Path\To\Workbooks\2016-07-11–2016-07-17.xlsx, Shape: (5331, 28) 
File: C:\Path\To\Workbooks\2016-08-01–2016-08-07.xlsx, Shape: (4965, 28) 
Malformed file saved as C:\Path\To\Workbooks\2016-08-15–2016-08-21_new.xlsx 
File: C:\Path\To\Workbooks\2016-08-15–2016-08-21.xlsx, Shape: (5315, 27) 
File: C:\Path\To\Workbooks\2016-08-22–2016-08-28.xlsx, Shape: (5179, 27) 
File: C:\Path\To\Workbooks\2016-08-29–2016-09-04.xlsx, Shape: (5855, 27) 
Malformed file saved as C:\Path\To\Workbooks\2016-09-05–2016-09-11_new.xlsx 
File: C:\Path\To\Workbooks\2016-09-05–2016-09-11.xlsx, Shape: (5838, 27) 
Malformed file saved as C:\Path\To\Workbooks\2016-09-12–2016-09-18_new.xlsx 
File: C:\Path\To\Workbooks\2016-09-12–2016-09-18.xlsx, Shape: (5729, 27) 
Malformed file saved as C:\Path\To\Workbooks\2016-09-19–2016-09-25_new.xlsx 
File: C:\Path\To\Workbooks\2016-09-19–2016-09-25.xlsx, Shape: (6401, 27) 
File: C:\Path\To\Workbooks\2016-09-26–2016-10-02.xlsx, Shape: (7018, 27) 
File: C:\Path\To\Workbooks\2016-09.xlsx, Shape: (23874, 27) 
File: C:\Path\To\Workbooks\2016-10-03–2016-10-09.xlsx, Shape: (6587, 27) 
Malformed file saved as C:\Path\To\Workbooks\2016-10-10–2016-10-12_new.xlsx 
File: C:\Path\To\Workbooks\2016-10-10–2016-10-12.xlsx, Shape: (2883, 27) 
Malformed file saved as C:\Path\To\Workbooks\2016-10-10–2016-10-13_new.xlsx 
File: C:\Path\To\Workbooks\2016-10-10–2016-10-13.xlsx, Shape: (4174, 27) 
Malformed file saved as C:\Path\To\Workbooks\2016-10-17–2016-10-20_new.xlsx 
File: C:\Path\To\Workbooks\2016-10-17–2016-10-20.xlsx, Shape: (4560, 27) 
Malformed file saved as C:\Path\To\Workbooks\2016-10-17–2016-10-23_new.xlsx 
File: C:\Path\To\Workbooks\2016-10-17–2016-10-23.xlsx, Shape: (7111, 27) 
Malformed file saved as C:\Path\To\Workbooks\2016-10-24–2016-10-27_new.xlsx 
File: C:\Path\To\Workbooks\2016-10-24–2016-10-27.xlsx, Shape: (4921, 27) 
Malformed file saved as C:\Path\To\Workbooks\2016-10-24–2016-10-30_new.xlsx 
File: C:\Path\To\Workbooks\2016-10-24–2016-10-30.xlsx, Shape: (8005, 27) 
Malformed file saved as C:\Path\To\Workbooks\2016-10-31–2016-11-06_new.xlsx 
File: C:\Path\To\Workbooks\2016-10-31–2016-11-06.xlsx, Shape: (7029, 27) 
Malformed file saved as C:\Path\To\Workbooks\2016-10_new.xlsx 
File: C:\Path\To\Workbooks\2016-10.xlsx, Shape: (28098, 27) 
Malformed file saved as C:\Path\To\Workbooks\2016-11-07–2016-11-13_new.xlsx 
File: C:\Path\To\Workbooks\2016-11-07–2016-11-13.xlsx, Shape: (7076, 27) 
Malformed file saved as C:\Path\To\Workbooks\2016-11-14–2016-11-20_new.xlsx 
File: C:\Path\To\Workbooks\2016-11-14–2016-11-20.xlsx, Shape: (7758, 27) 
Malformed file saved as C:\Path\To\Workbooks\2016-11-21_new.xlsx 
File: C:\Path\To\Workbooks\2016-11-21.xlsx, Shape: (1689, 27) 
Malformed file saved as C:\Path\To\Workbooks\2016-11-21–2016-11-23_new.xlsx 
File: C:\Path\To\Workbooks\2016-11-21–2016-11-23.xlsx, Shape: (4711, 27) 
Malformed file saved as C:\Path\To\Workbooks\2016-11-28–2016-12-04_new.xlsx 
File: C:\Path\To\Workbooks\2016-11-28–2016-12-04.xlsx, Shape: (9286, 27) 
Malformed file saved as C:\Path\To\Workbooks\2016-11_new.xlsx 
File: C:\Path\To\Workbooks\2016-11.xlsx, Shape: (30505, 27) 
Malformed file saved as C:\Path\To\Workbooks\2016-12-05–2016-12-11_new.xlsx 
File: C:\Path\To\Workbooks\2016-12-05–2016-12-11.xlsx, Shape: (8802, 27) 
Malformed file saved as C:\Path\To\Workbooks\2016-12-12–2016-12-18_new.xlsx 
File: C:\Path\To\Workbooks\2016-12-12–2016-12-18.xlsx, Shape: (8333, 27) 
Malformed file saved as C:\Path\To\Workbooks\2016-12-16–2016-12-22_new.xlsx 
File: C:\Path\To\Workbooks\2016-12-16–2016-12-22.xlsx, Shape: (8592, 27) 
Malformed file saved as C:\Path\To\Workbooks\2016-12-26–2016-12-31_new.xlsx 
File: C:\Path\To\Workbooks\2016-12-26–2016-12-31.xlsx, Shape: (5362, 27) 
Malformed file saved as C:\Path\To\Workbooks\2017-01-01–2017-01-08_new.xlsx 
File: C:\Path\To\Workbooks\2017-01-01–2017-01-08.xlsx, Shape: (4322, 27) 
Malformed file saved as C:\Path\To\Workbooks\2017-01-09–2017-01-15_new.xlsx 
File: C:\Path\To\Workbooks\2017-01-09–2017-01-15.xlsx, Shape: (7608, 27) 
Malformed file saved as C:\Path\To\Workbooks\2017-01-23–2017-01-29_new.xlsx 
File: C:\Path\To\Workbooks\2017-01-23–2017-01-29.xlsx, Shape: (8903, 27) 
Malformed file saved as C:\Path\To\Workbooks\2017-01-30–2017-02-05_new.xlsx 
File: C:\Path\To\Workbooks\2017-01-30–2017-02-05.xlsx, Shape: (9173, 27) 
Malformed file saved as C:\Path\To\Workbooks\2017-02-13–2017-02-12_new.xlsx 
File: C:\Path\To\Workbooks\2017-02-13–2017-02-12.xlsx, Shape: (9144, 27) 
Malformed file saved as C:\Path\To\Workbooks\2017-02-13–2017-02-19_new.xlsx 
File: C:\Path\To\Workbooks\2017-02-13–2017-02-19.xlsx, Shape: (9911, 27) 
File: C:\Path\To\Workbooks\test.xlsx, Shape: (5315, 27) 
Malformed file saved as C:\Path\To\Workbooks\Выгрузка 12-15.12_new.xlsx 
File: C:\Path\To\Workbooks\Выгрузка 12-15.12.xlsx, Shape: (4818, 27) 
Malformed file saved as C:\Path\To\Workbooks\Выгрузка 21-27_new.xlsx 
File: C:\Path\To\Workbooks\Выгрузка 21-27.xlsx, Shape: (8876, 27) 
File: C:\Path\To\Workbooks\Выгрузка 26-29.12.xlsx, Shape: (4539, 27) 
Final dataframe shape: (330257, 30) 

考慮甚至通過pyodbc使用Windows的ACE引擎數據庫引擎的方式來查詢與相應的工作簿(的330257行和30列的最終數據幀)熊貓read_sql由於每個分享相同的名稱,TDSheet

#...same as above 
import pyodbc 

def sql_read():  
    dfs = [] 
    for f in xlsxfiles:     
     try: 
      conn = pyodbc.connect('Driver={Microsoft Excel Driver (*.xls, *.xlsx, *.xlsm, *.xlsb)};'+\ 
         'DBQ=C:\\Path\\To\\Workbooks\\{};'.format(f), autocommit=True) 
      df = pd.read_sql('SELECT * FROM [TDSheet$];', conn) 

     except Exception as e: 
      conn.close() 
      conn = pyodbc.connect('Driver={Microsoft Excel Driver (*.xls, *.xlsx, *.xlsm, *.xlsb)};'+\ 
         'DBQ=C:\\Path\\To\\Workbooks\\{};'.format(save_xlsx(f)), autocommit=True) 
      df = pd.read_excel('SELECT * FROM [TDSheet$];', conn) 
      conn.close() 

     print('File: {}, Shape: {}'.format(f, df.shape)) 
     dfs.append(df) 
+0

解決了!謝謝。 –

1

它看起來像你的一些Excel文件的是畸形的:

import os 
import glob 
import pandas as pd 

excel_files_mask = r'D:\temp\.data\42468475\*.xlsx' 

files = glob.glob(excel_files_mask) 

def merge_excel_files(files, **kwargs): 
    #return pd.concat([pd.read_excel(f, **kwargs) for f in files], 
    #     ignore_index=True) 
    dfs = [] 
    for f in files: 
     #print('processing: [{}]'.format(f)) 
     try: 
      df = pd.read_excel(f, **kwargs) 
      dfs.append(df) 
      print('parsed: [{}], shape: {}'.format(f, df.shape)) 
     except KeyError: 
      print("ERROR: file [{}] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ...".format(f)) 
    return pd.concat(dfs, ignore_index=True) 

df = merge_excel_files(files, header=None, skiprows=1) 
print(df.shape) 

產量:

parsed: [D:\temp\.data\42468475\2016-06-20–2016-06-26.xlsx], shape: (5912, 27) 
parsed: [D:\temp\.data\42468475\2016-06-27–2016-07-03.xlsx], shape: (5362, 27) 
parsed: [D:\temp\.data\42468475\2016-07-04–2016-07-10.xlsx], shape: (5387, 27) 
parsed: [D:\temp\.data\42468475\2016-07-11–2016-07-17.xlsx], shape: (5331, 28) 
parsed: [D:\temp\.data\42468475\2016-08-01–2016-08-07.xlsx], shape: (4965, 28) 
ERROR: file [D:\temp\.data\42468475\2016-08-15–2016-08-21.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ... 
parsed: [D:\temp\.data\42468475\2016-08-22–2016-08-28.xlsx], shape: (5179, 27) 
parsed: [D:\temp\.data\42468475\2016-08-29–2016-09-04.xlsx], shape: (5855, 27) 
ERROR: file [D:\temp\.data\42468475\2016-09-05–2016-09-11.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ... 
ERROR: file [D:\temp\.data\42468475\2016-09-12–2016-09-18.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ... 
ERROR: file [D:\temp\.data\42468475\2016-09-19–2016-09-25.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ... 
parsed: [D:\temp\.data\42468475\2016-09-26–2016-10-02.xlsx], shape: (7018, 27) 
parsed: [D:\temp\.data\42468475\2016-09.xlsx], shape: (23874, 27) 
parsed: [D:\temp\.data\42468475\2016-10-03–2016-10-09.xlsx], shape: (6587, 27) 
ERROR: file [D:\temp\.data\42468475\2016-10-10–2016-10-12.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ... 
ERROR: file [D:\temp\.data\42468475\2016-10-10–2016-10-13.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ... 
ERROR: file [D:\temp\.data\42468475\2016-10-17–2016-10-20.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ... 
ERROR: file [D:\temp\.data\42468475\2016-10-17–2016-10-23.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ... 
ERROR: file [D:\temp\.data\42468475\2016-10-24–2016-10-27.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ... 
ERROR: file [D:\temp\.data\42468475\2016-10-24–2016-10-30.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ... 
ERROR: file [D:\temp\.data\42468475\2016-10-31–2016-11-06.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ... 
ERROR: file [D:\temp\.data\42468475\2016-10.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ... 
ERROR: file [D:\temp\.data\42468475\2016-11-07–2016-11-13.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ... 
ERROR: file [D:\temp\.data\42468475\2016-11-14–2016-11-20.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ... 
ERROR: file [D:\temp\.data\42468475\2016-11-21.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ... 
ERROR: file [D:\temp\.data\42468475\2016-11-21–2016-11-23.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ... 
ERROR: file [D:\temp\.data\42468475\2016-11-28–2016-12-04.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ... 
ERROR: file [D:\temp\.data\42468475\2016-11.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ... 
ERROR: file [D:\temp\.data\42468475\2016-12-05–2016-12-11.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ... 
ERROR: file [D:\temp\.data\42468475\2016-12-12–2016-12-18.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ... 
ERROR: file [D:\temp\.data\42468475\2016-12-16–2016-12-22.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ... 
ERROR: file [D:\temp\.data\42468475\2016-12-26–2016-12-31.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ... 
ERROR: file [D:\temp\.data\42468475\2017-01-01–2017-01-08.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ... 
ERROR: file [D:\temp\.data\42468475\2017-01-09–2017-01-15.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ... 
ERROR: file [D:\temp\.data\42468475\2017-01-23–2017-01-29.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ... 
ERROR: file [D:\temp\.data\42468475\2017-01-30–2017-02-05.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ... 
ERROR: file [D:\temp\.data\42468475\2017-02-13–2017-02-12.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ... 
ERROR: file [D:\temp\.data\42468475\2017-02-13–2017-02-19.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ... 
parsed: [D:\temp\.data\42468475\test.xlsx], shape: (5315, 27) 
ERROR: file [D:\temp\.data\42468475\Выгрузка 12-15.12.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ... 
ERROR: file [D:\temp\.data\42468475\Выгрузка 21-27.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ... 
parsed: [D:\temp\.data\42468475\Выгрузка 26-29.12.xlsx], shape: (4539, 27) 
(85324, 28)