使用python中的非統一行解析數據

我有一個數據集，我想解析它來分析它。我想抽出特定的列，然後在非統一行之前和之後分開它們。以下是對我的數據看起來像一個例子：注意中間的三排不匹配其他行的格式：使用python中的非統一行解析數據

1386865618963 1 M subject_avatar 3.636229 1.000000 5.422941 30.200327 0.000000 0.000000 
1386865618965 1 M subject_avatar 3.631835 1.000000 5.415390 30.200327 0.000000 0.000000 
1386865618966 2 M subject_avatar 3.627432 1.000000 5.407826 30.200327 0.000000 0.000000 
1386865618968 1 M subject_avatar 3.625223 1.000000 5.404030 30.200327 0.000000 0.000000 
1386865618970 1 M subject_avatar 3.620788 1.000000 5.396411 30.200327 0.000000 0.000000 
1386865618970 0 D 4345048336 
1386865618970 0 D 4345763672 
1386865618971 0 I BOXGEOM (45.0, 0.0, -45.0, 19.0, 3.5, 19.0) {'callback': <bound method YCEnvironment.dropoff of <navigate.YCEnvironment instance at 0x103065440>>, 'cbargs': (0, {'width': 1.75, 'image': <pyepl.display.Image object at 0x102f9da90>, 'height': 4.75, 'volbitSize': (0.5, 0.71999999999999997), 'name': 'Julia'}, {'width': 0.69999999999999996, 'name': 'Flower Patch', 'realpos': (45.0, 0.0, -45.0), 'image': <pyepl.display.Image object at 0x102fc3f50>, 'realsize': (7.0, 3.5, 7.0), 'type': 'store', 'volbitSize': (0.5, 0.5), 'height': 0.34999999999999998}), 'permiable': True} 4926595152 
1386865618972 1 M subject_avatar 3.621182 1.000000 5.396492 30.200327 0.000000 0.000000 
1386865618992 2 M subject_avatar 3.621182 1.000000 5.396492 30.200327 0.000000 0.000000 
1386865618996 1 M subject_avatar 3.621182 1.000000 5.396492 30.200327 0.000000 0.000000 
1386865618998 2 M subject_avatar 3.621182 1.000000 5.396492 30.200327 0.000000 0.000000 
1386865619002 1 M subject_avatar 3.621182 1.000000 5.396492 30.200327 0.000000 0.000000 
1386865619005 1 M subject_avatar 3.621182 1.000000 5.396492 30.200327 0.000000 0.000000 
1386865619008 1 M subject_avatar 3.621182 1.000000 5.396492 30.200327 0.000000 0.000000

我以前問一個問題（Parsing specific columns from a dataset in python）來分析這些數據轉換爲列但是，列僅顯示列中項目的數量，而不顯示項目本身。

我意識到這些是兩個不同的問題（分成列，在非統一行之前和之後分開），但任何幫助解析將不勝感激！

來源

2014-01-06 Julia

「獨立」是什麼意思？你只是想刪除D＆I行，或者你想讓Ms的每個羣集以某種方式分組？ – DSM

我想刪除D行和I行，並將Ms集羣顯示在D行和I行之前發生的Ms，以及在D行和I行之後發生的Ms行。 – Julia

一個簡單的想法：

可以預處理原始文件跳過所有無關的線條，也許：

with open('raw.txt', 'r') as infile: 
    f = infile.readlines() 
    with open('filtered.txt', 'w') as outfile: 
     for line in f: 
      if 'subject_avatar' in line: # or other better rules 
       outfile.write(line)

然後你使用處理或pandas否則filtered.txt乾淨的數據。

with open('d.txt', 'r') as infile: 
    f = infile.readlines() 
    with open('filtered_part1.txt', 'w') as outfile: 
     for i in range(len(f)): 
      line = f[i] 
      if line[16] == '0': 
       i += 1 
       break 
      outfile.write(line) 
    while f[i][16] == '0': # skip a few lines 
     i += 1 
    with open('filtered_part2.txt', 'w') as outfile: 
     while i < len(f): 
      outfile.write(f[i]) 
      i += 1

醜陋但可行的分離這裏提供。基本上找到0並跳過線。

來源

2014-01-06 16:37:31 Ray

謝謝，這工作得很好！現在，您是否知道我如何區分之前的數據和忽略的行之後的數據？ – Julia

@Julia很高興工作。你是否只有一個這樣的特定數據文件，或者上面的只是一個插圖？ – Ray

@Julia我能想到的一種方法是逐行檢查原始文件第二列或第三列（字符串的特定索引）。一旦你遇到這些線要忽略，你知道它是第一部分的結束和第二部分的開始。 – Ray

如果你想省略非均勻的線條，你可以簡單地檢查每一行的長度：

rows = [] 
for line in lines: 
    row = line.split() 
    if len(row) == 10: 
     rows.append(row)

來源

2014-01-06 16:32:32

使用python中的非統一行解析數據

回答

相關問題