2017-02-26 88 views
0

我的文本文件爲每個數據庫都有表。熊貓可以讀取這個文件併爲每個數據庫創建單獨的數據框嗎?從單個文件讀取多個數據集

Database: ABC 
+-----------------------------------------------+----------+------------+ 
|     Tables      | Columns | Total Rows | 
+-----------------------------------------------+----------+------------+ 
| ApplicationUpdateBankLog      |  13 |   0 | 
| ChangeLogTemp         |  12 | 1678363 | 
| Sheet2$          |  10 |  359 | 
| tempAllowApplications       |  1 |   9 | 
+-----------------------------------------------+----------+------------+ 
4 rows in set. 


Database: XYZ 
+--------------------------------------------------+----------+------------+ 
|      Tables      | Columns | Total Rows | 
+--------------------------------------------------+----------+------------+ 
| BKP_QualificationDetails_12082014    |  14 | 7959877 | 
| BillNotGeneratedCount       |  11 |  2312 | 
| VVshipBenefit         |  19 |  197356 | 
| VVBenefit_Bkup29012016       |  19 |  101318 | 
+--------------------------------------------------+----------+------------+ 
4 rows in set. 

回答

1

可以使用dict comprehension創建dictDataFrames的:

import pandas as pd 
from pandas.compat import StringIO 

temp=u"""Database: ABC 
+-----------------------------------------------+----------+------------+ 
|     Tables      | Columns | Total Rows | 
+-----------------------------------------------+----------+------------+ 
| ApplicationUpdateBankLog      |  13 |   0 | 
| ChangeLogTemp         |  12 | 1678363 | 
| Sheet2$          |  10 |  359 | 
| tempAllowApplications       |  1 |   9 | 
+-----------------------------------------------+----------+------------+ 
4 rows in set. 


Database: XYZ 
+--------------------------------------------------+----------+------------+ 
|      Tables      | Columns | Total Rows | 
+--------------------------------------------------+----------+------------+ 
| BKP_QualificationDetails_12082014    |  14 | 7959877 | 
| BillNotGeneratedCount       |  11 |  2312 | 
| VVshipBenefit         |  19 |  197356 | 
| VVBenefit_Bkup29012016       |  19 |  101318 | 
+--------------------------------------------------+----------+------------+ 
4 rows in set.""" 
#after testing replace 'StringIO(temp)' to 'filename.csv' 
df = pd.read_csv(StringIO(temp), sep="|", names=['a', 'Tables', 'Columns', 'Total Rows']) 
#replace NaN in column a created where not 'Database' by forward filing 
df.a = df.a.where(df.a.str.startswith('Database')).ffill() 
#remove rows where NaN in Tables column 
df = df.dropna(subset=['Tables']) 
#remove all whitespaces, set index for selecting in dict comprehension 
df = df.apply(lambda x: x.str.strip()).set_index('a') 
#convert to numeric columns, replace NaN, convert to int 
df['Columns'] = pd.to_numeric(df['Columns'], errors='coerce').fillna(0).astype(int) 
df['Total Rows'] = pd.to_numeric(df['Total Rows'], errors='coerce').fillna(0).astype(int) 
#remove rows with value Tables 
df = df[df['Tables'] != 'Tables'] 
print (df) 
              Tables Columns Total Rows 
a                  
Database: ABC   ApplicationUpdateBankLog  13   0 
Database: ABC      ChangeLogTemp  12  1678363 
Database: ABC       Sheet2$  10   359 
Database: ABC    tempAllowApplications  1   9 
Database: XYZ BKP_QualificationDetails_12082014  14  7959877 
Database: XYZ    BillNotGeneratedCount  11  2312 
Database: XYZ      VVshipBenefit  19  197356 
Database: XYZ    VVBenefit_Bkup29012016  19  101318 

#select in dict comprehension and reset index to default monotonic index 
dfs = {x:df.loc[x].reset_index(drop=True) for x in df.index.unique()} 
print (dfs['Database: ABC']) 
        Tables Columns Total Rows 
0 ApplicationUpdateBankLog  13   0 
1    ChangeLogTemp  12 1678363 
2     Sheet2$  10  359 
3  tempAllowApplications  1   9 

print (dfs['Database: XYZ']) 
           Tables Columns Total Rows 
0 BKP_QualificationDetails_12082014  14 7959877 
1    BillNotGeneratedCount  11  2312 
2      VVshipBenefit  19  197356 
3    VVBenefit_Bkup29012016  19  101318