2016-09-21 211 views
2

我試圖使用Python將具有行和列的文件從MTurk中解析爲.csv-文件。我的數據如下:如何使用.log文件在.csv文件中創建行和列

P:14142綠色800,9: 16108,7,F,NaN時,正確的,P:,17115,GREEN,100,9; R:,17548,7,Y,NaN時,正確的,P:,18552,#E5DC22,100,9; R:,18972 7,F,NaN時,正確的,P:,19979,GREEN,800,9; R:,20379,7,Y,NaN時,正確的,P:,21387,#E5DC22,800,9; R:,21733, 7,F,NaN時,正確的,P:,22740,RED,100,9; R:,23139,7,Y,NaN時,假,P:,24147,藍色,100,9; R:,24547,7,女,NaN時,假,P:,25555,RED,800,9; R:,26043,7,b,NaN時,正確的,P:,27051,藍色,800,9;

目前,我有這個,這使一切都在向列:

import pandas as pd 
from pandas import read_table 
log_file = '3BF51CHDTWYBE3LE8DZRA0R5AFGH0H.log' 
df = read_table(log_file, sep=';|,', header=None, engine='python') 

像這樣:

P | 14142 |綠色| 800 | 9 | R | 14597 | 7 | Y |的NaN |正確| P | 15605 |#E5DC22 | 800 | 9 | R | 16108

不過,我似乎無法能夠在多個行打破這一點,所以它會看起來更利ke:

P | 14142 | GREEN | 800 | 9 | R | 14597 | 7 | y | NaN | Correct |
| P | 15605 |#E5DC22 | 800 | 9 | R | 16108

即,其中所有的 「P」 s就在一列中,其中,所有的顏色將是又一箇中, 「R」 S,等等。

+0

你應該格式化你的代碼,以使問題可讀...... – IanS

+0

請有關如何格式化代碼讀取。 – wander95

回答

1

您可以使用

In [16]: df = pd.read_csv('log.txt', lineterminator=';', sep=':', header=None) 

讀取文件(比如說,'log.txt')假設線由​​3210終止,行內的分離器是':'

不幸的是,您的第二列現在將包含您希望在邏輯上分開的逗號。可以沿着切斷預定線分割逗號,並將結果級聯到第一列:

In [17]: pd.concat([df[[0]], df[1].str.split(',').apply(pd.Series).iloc[:, 1: 6]], axis=1) 
Out[17]: 
     0  1  2 3 4  5 
0  P 14142 GREEN 800 9  NaN 
1  R 14597  7 y NaN Correct 
2  P 15605 #E5DC22 800 9  NaN 
3  R 16108  7 f NaN Correct 
4  P 17115 GREEN 100 9  NaN 
5  R 17548  7 y NaN Correct 
6  P 18552 #E5DC22 100 9  NaN 
7  R 18972  7 f NaN Correct 
8  P 19979 GREEN 800 9  NaN 
9  R 20379  7 y NaN Correct 
10  P 21387 #E5DC22 800 9  NaN 
11  R 21733  7 f NaN Correct 
12  P 22740  RED 100 9  NaN 
13  R 23139  7 y NaN False 
14  P 24147  BLUE 100 9  NaN 
15  R 24547  7 f NaN False 
16  P 25555  RED 800 9  NaN 
17  R 26043  7 b NaN Correct 
18  P 27051  BLUE 800 9  NaN 
19 \n\n NaN  NaN NaN NaN  NaN 
0

另一個更快的解決方案:從第一列

import pandas as pd 
import numpy as np 
import io 

temp=u"""P:,14142,GREEN,800,9;R:,14597,7,y,NaN,Correct;P:,15605,#E5DC22,800,9;R:,16108,7,f,NaN,Correct;P:,17115,GREEN,100,9;R:,17548,7,y,NaN,Correct;P:,18552,#E5DC22,100,9;R:,18972,7,f,NaN,Correct;P:,19979,GREEN,800,9;R:,20379,7,y,NaN,Correct;P:,21387,#E5DC22,800,9;R:,21733,7,f,NaN,Correct;P:,22740,RED,100,9;R:,23139,7,y,NaN,False;P:,24147,BLUE,100,9;R:,24547,7,f,NaN,False;P:,25555,RED,800,9;R:,26043,7,b,NaN,Correct;P:,27051,BLUE,800,9;""" 
#after testing replace io.StringIO(temp) to filename 
df = pd.read_csv(io.StringIO(temp), sep=':', header=None, lineterminator=';') 

print (df) 
    0      1 
0 P  ,14142,GREEN,800,9 
1 R ,14597,7,y,NaN,Correct 
2 P ,15605,#E5DC22,800,9 
3 R ,16108,7,f,NaN,Correct 
4 P  ,17115,GREEN,100,9 
5 R ,17548,7,y,NaN,Correct 
6 P ,18552,#E5DC22,100,9 
7 R ,18972,7,f,NaN,Correct 
8 P  ,19979,GREEN,800,9 
9 R ,20379,7,y,NaN,Correct 
10 P ,21387,#E5DC22,800,9 
11 R ,21733,7,f,NaN,Correct 
12 P  ,22740,RED,100,9 
13 R ,23139,7,y,NaN,False 
14 P  ,24147,BLUE,100,9 
15 R ,24547,7,f,NaN,False 
16 P  ,25555,RED,800,9 
17 R ,26043,7,b,NaN,Correct 
18 P  ,27051,BLUE,800,9 

首先set_index索引,然後通過strip和除去triling ,通過str.split創建DataFrame。最後需要補充0列名和reset_index

df1 = df.set_index(0)[1].str.strip(',').str.split(',', expand=True) 
df1.columns = df1.columns + 1 
df1.reset_index(inplace=True) 
print (df1) 
    0  1  2 3 4  5 
0 P 14142 GREEN 800 9  None 
1 R 14597  7 y NaN Correct 
2 P 15605 #E5DC22 800 9  None 
3 R 16108  7 f NaN Correct 
4 P 17115 GREEN 100 9  None 
5 R 17548  7 y NaN Correct 
6 P 18552 #E5DC22 100 9  None 
7 R 18972  7 f NaN Correct 
8 P 19979 GREEN 800 9  None 
9 R 20379  7 y NaN Correct 
10 P 21387 #E5DC22 800 9  None 
11 R 21733  7 f NaN Correct 
12 P 22740  RED 100 9  None 
13 R 23139  7 y NaN False 
14 P 24147  BLUE 100 9  None 
15 R 24547  7 f NaN False 
16 P 25555  RED 800 9  None 
17 R 26043  7 b NaN Correct 
18 P 27051  BLUE 800 9  None 

時序

def jez(df): 
    df1 = df.set_index(0)[1].str.strip(',').str.split(',', expand=True) 
    df1.columns = df1.columns + 1 
    df1.reset_index(inplace=True) 
    return (df1) 

print (jez(df)) 

In [310]: %timeit (pd.concat([df[[0]], df[1].str.split(',').apply(pd.Series).iloc[:, 1: 6]], axis=1)) 
100 loops, best of 3: 4.85 ms per loop 

In [311]: %timeit (jez(df)) 
1000 loops, best of 3: 1.61 ms per loop 
+0

請檢查我的更快解決方案。 – jezrael