2016-01-23 102 views
0

我有以下代碼:文本提取線分割與Python

f = open('./dat.txt', 'r') 
array = [] 
for line in f: 
    # if "1\t\"Overall evaluation" in line: 
    # words = line.split("1\t\"Overall evaluation") 
    # print words[0] 
    number = int(line.split(':')[1].strip('"\n')) 
    print number 

這是能夠從我的數據,它看起來像這樣抓住了最後的int:

299 1 "Overall evaluation: 3 
Invite to interview: 3 
Strength or novelty of the idea (1): 4 
Strength or novelty of the idea (2): 3 
Strength or novelty of the idea (3): 3 
Use or provision of open data (1): 4 
Use or provision of open data (2): 3 
""Open by default"" (1): 2 
""Open by default"" (2): 3 
Value proposition and potential scale (1): 4 
Value proposition and potential scale (2): 2 
Market opportunity and timing (1): 4 
Market opportunity and timing (2): 4 
Triple bottom line impact (1): 4 
Triple bottom line impact (2): 2 
Triple bottom line impact (3): 2 
Knowledge and skills of the team (1): 3 
Knowledge and skills of the team (2): 4 
Capacity to realise the idea (1): 4 
Capacity to realise the idea (2): 3 
Capacity to realise the idea (3): 4 
Appropriateness of the budget to realise the idea: 3" 
299 2 "Overall evaluation: 3 
Invite to interview: 3 
Strength or novelty of the idea (1): 3 
Strength or novelty of the idea (2): 2 
Strength or novelty of the idea (3): 4 
Use or provision of open data (1): 4 
Use or provision of open data (2): 3 
""Open by default"" (1): 3 
""Open by default"" (2): 2 
Value proposition and potential scale (1): 4 
Value proposition and potential scale (2): 3 
Market opportunity and timing (1): 4 
Market opportunity and timing (2): 3 
Triple bottom line impact (1): 3 
Triple bottom line impact (2): 2 
Triple bottom line impact (3): 1 
Knowledge and skills of the team (1): 4 
Knowledge and skills of the team (2): 4 
Capacity to realise the idea (1): 4 
Capacity to realise the idea (2): 4 
Capacity to realise the idea (3): 4 
Appropriateness of the budget to realise the idea: 2" 

364 1 "Overall evaluation: 3 
Invite to interview: 3 
... 

我還需要抓取「記錄標識符」,在上面的例子中,前兩個實例爲299,然後364爲下一個實例。

上面的註釋掉的代碼,如果我刪除的最後幾行,只是使用它,像這樣:

f = open('./dat.txt', 'r') 
array = [] 
for line in f: 
    if "1\t\"Overall evaluation" in line: 
     words = line.split("1\t\"Overall evaluation") 
     print words[0] 
    # number = int(line.split(':')[1].strip('"\n')) 
    # print number 

可以抓住的記錄標識。

但我很難把兩者放在一起。

理想的情況是我想要的是類似如下:

368 

=2+3+3+3+4+3+2+3+2+3+2+3+2+3+2+3+2+4+3+2+3+2 

=2+3+3+3+4+3+2+3+2+3+2+3+2+3+2+3+2+4+3+2+3+2 

等的所有記錄。

我該如何結合上述兩個腳本組件來實現?

+0

你看起來像一個有經驗的用戶,應該知道_that_不是用Python處理數據的方式。相反,我建議你處理字典。 –

+0

看起來可能是騙人的。你什麼意思? –

+0

我的意思是,該dat.txt文件不是以有利的方式爲您解析它。你應該試着讓它(比如說,從哪裏得到)適當地構造,比如作爲字典,所以你唯一需要做的就是傳遞你想要的密鑰(記錄標識符,你稱它爲) –

回答

1

正則表達式是門票。你可以用兩種模式來完成。事情是這樣的:

import re 

with open('./dat.txt') as fin: 
    for line in fin: 
     ma = re.match(r'^(\d+) \d.+Overall evaluation', line) 
     if ma: 
      print("record identifier %r" % ma.group(1)) 
      continue 
     ma = re.search(r': (\d+)$', line) 
     if ma: 
      print(ma.group(1)) 
      continue 
     print("unrecognized line: %s" % line) 

注意:最後的print語句是不是你要求的一部分,但每當我調試正則表達式,我總是添加某種包羅萬象,以協助調試不好的正則表達式語句。一旦我得到我的模式,我刪除catchall。