文本提取線分割與Python

我有以下代碼：文本提取線分割與Python

f = open('./dat.txt', 'r') 
array = [] 
for line in f: 
    # if "1\t\"Overall evaluation" in line: 
    # words = line.split("1\t\"Overall evaluation") 
    # print words[0] 
    number = int(line.split(':')[1].strip('"\n')) 
    print number

這是能夠從我的數據，它看起來像這樣抓住了最後的int：

299 1 "Overall evaluation: 3 
Invite to interview: 3 
Strength or novelty of the idea (1): 4 
Strength or novelty of the idea (2): 3 
Strength or novelty of the idea (3): 3 
Use or provision of open data (1): 4 
Use or provision of open data (2): 3 
""Open by default"" (1): 2 
""Open by default"" (2): 3 
Value proposition and potential scale (1): 4 
Value proposition and potential scale (2): 2 
Market opportunity and timing (1): 4 
Market opportunity and timing (2): 4 
Triple bottom line impact (1): 4 
Triple bottom line impact (2): 2 
Triple bottom line impact (3): 2 
Knowledge and skills of the team (1): 3 
Knowledge and skills of the team (2): 4 
Capacity to realise the idea (1): 4 
Capacity to realise the idea (2): 3 
Capacity to realise the idea (3): 4 
Appropriateness of the budget to realise the idea: 3" 
299 2 "Overall evaluation: 3 
Invite to interview: 3 
Strength or novelty of the idea (1): 3 
Strength or novelty of the idea (2): 2 
Strength or novelty of the idea (3): 4 
Use or provision of open data (1): 4 
Use or provision of open data (2): 3 
""Open by default"" (1): 3 
""Open by default"" (2): 2 
Value proposition and potential scale (1): 4 
Value proposition and potential scale (2): 3 
Market opportunity and timing (1): 4 
Market opportunity and timing (2): 3 
Triple bottom line impact (1): 3 
Triple bottom line impact (2): 2 
Triple bottom line impact (3): 1 
Knowledge and skills of the team (1): 4 
Knowledge and skills of the team (2): 4 
Capacity to realise the idea (1): 4 
Capacity to realise the idea (2): 4 
Capacity to realise the idea (3): 4 
Appropriateness of the budget to realise the idea: 2" 

364 1 "Overall evaluation: 3 
Invite to interview: 3 
...

我還需要抓取「記錄標識符」，在上面的例子中，前兩個實例爲299，然後364爲下一個實例。

上面的註釋掉的代碼，如果我刪除的最後幾行，只是使用它，像這樣：

f = open('./dat.txt', 'r') 
array = [] 
for line in f: 
    if "1\t\"Overall evaluation" in line: 
     words = line.split("1\t\"Overall evaluation") 
     print words[0] 
    # number = int(line.split(':')[1].strip('"\n')) 
    # print number

可以抓住的記錄標識。

但我很難把兩者放在一起。

理想的情況是我想要的是類似如下：

368 

=2+3+3+3+4+3+2+3+2+3+2+3+2+3+2+3+2+4+3+2+3+2 

=2+3+3+3+4+3+2+3+2+3+2+3+2+3+2+3+2+4+3+2+3+2

等的所有記錄。

我該如何結合上述兩個腳本組件來實現？

來源

2016-01-23 s.matthew.english

你看起來像一個有經驗的用戶，應該知道_that_不是用Python處理數據的方式。相反，我建議你處理字典。 –

看起來可能是騙人的。你什麼意思？ –

我的意思是，該dat.txt文件不是以有利的方式爲您解析它。你應該試着讓它（比如說，從哪裏得到）適當地構造，比如作爲字典，所以你唯一需要做的就是傳遞你想要的密鑰（記錄標識符，你稱它爲） –

正則表達式是門票。你可以用兩種模式來完成。事情是這樣的：

import re 

with open('./dat.txt') as fin: 
    for line in fin: 
     ma = re.match(r'^(\d+) \d.+Overall evaluation', line) 
     if ma: 
      print("record identifier %r" % ma.group(1)) 
      continue 
     ma = re.search(r': (\d+)$', line) 
     if ma: 
      print(ma.group(1)) 
      continue 
     print("unrecognized line: %s" % line)

注意：最後的print語句是不是你要求的一部分，但每當我調試正則表達式，我總是添加某種包羅萬象，以協助調試不好的正則表達式語句。一旦我得到我的模式，我刪除catchall。

來源

2016-01-23 17:44:52 user590028

文本提取線分割與Python

回答

相關問題