2017-10-13 51 views
-1

這是我之前發佈的here的延續,在這裏我正在努力解析RIS文件。但是,現在我已經將一些代碼合併到一個新的解析器中,該解析器正確地讀取了一條記錄。不幸的是,代碼在第一條記錄之後停止,而我不知道如何區分文件結尾和雙重新聞空間字符之間的單獨記錄。任何想法?如何在Python中正確讀取雙換行符

輸入文件在這裏提供:

Record #1 of 306 
ID: CN-01160769 
AU: Uedo N 
AU: Yao K 
AU: Muto M 
AU: Ishikawa H 
TI: Development of an E-learning system. 
SO: United European Gastroenterology Journal 
YR: 2015 
VL: 3 
NO: 5 SUPPL. 1 
PG: A490 
XR: EMBASE 72267184 
PT: Journal: Conference Abstract 
DOI: 10.1177/2050640615601623 
US: http://onlinelibrary.wiley.com/o/cochrane/clcentral/articles/769/CN-01160769/frame.html 


Record #2 of 306 
ID: CN-01070265 
AU: Krogh LQ 
AU: Bjornshave K 
AU: Vestergaard LD 
AU: Sharma MB 
AU: Rasmussen SE 
AU: Nielsen HV 
AU: Thim T 
AU: Lofgren B 
TI: E-learning in pediatric basic life support: A randomized controlled non-inferiority study. 
SO: Resuscitation 
YR: 2015 
VL: 90 
PG: 7-12 
XR: EMBASE 2015935529 
PT: Journal: Article 
DOI: 10.1016/j.resuscitation.2015.01.030 
US: http://onlinelibrary.wiley.com/o/cochrane/clcentral/articles/265/CN-01070265/frame.html 


Record #3 of 306 
ID: CN-00982835 
AU: Worm BS 
AU: Jensen K 
TI: Does peer learning or higher levels of e-learning improve learning abilities? 
SO: Medical education online 
YR: 2013 
VL: 18 
NO: 1 
PG: 21877 
PM: PUBMED 28166018 
XR: EMBASE 24229729 
PT: Journal Article; Randomized Controlled Trial 
DOI: 10.3402/meo.v18i0.21877 
US: http://onlinelibrary.wiley.com/o/cochrane/clcentral/articles/835/CN-00982835/frame.html 

而且代碼粘貼下面:

import re 

# Function to process single record 
def read_record(infile): 
    line = infile.readline() 
    line = line.strip() 

    if not line: 
     # End of file 
     return None 

    if not line.startswith("Record"): 
     raise TypeError("Not a proper file: %r" % line) 

    # Read tags and fields 
    tags = [] 
    fields = [] 
    while 1: 
     line = infile.readline().rstrip() 
     if line == "": 
      # Reached the end of the record or end of the file 
      break 
     prog = re.compile("^([A-Z][A-Z0-9][A-Z]?): (.*)") 
     match = prog.match(line) 
     tag = match.groups()[0] 
     field = match.groups()[1] 
     tags.append(tag) 
     fields.append(field) 

    return [tags, fields] 


# Function to loop through records 
def read_records(input_file): 
    records = [] 
    while 1: 
     record = read_record(input_file) 
     if record is None: 
      break 
     records.append(record) 
    return records 


infile = open("test.txt") 

for record in read_records(infile): 
    print(record) 
+1

「不行」絕對不意味着文件結束。 – ForceBru

+0

'if not line:'check必須在'.strip()'前進行。 – jasonharper

+0

@jasonharper不起作用 – Andrej

回答

1

瞭解如何遍歷使用for line in infile:一行文件行。無需測試與一個「」文件的末尾,for循環迭代會爲你做的:

for line in infile: 
    # remove trailing newlines, and truncate lines that 
    # are all-whitespace down to just '' 
    line = line.rstrip() 

    if line: 
     # there is something on this line 
    else: 
     # this is a blank line - but it is definitely NOT the end-of-file 
0

至於建議由@PaulMcG這裏是一個解決方案,它在由文件裏逐行迭代。

import re 

records = [] 
count_records = 0 
count_newlines = 0 
prog = re.compile("^([A-Z][A-Z0-9][A-Z]?): (.*)") 
bom = re.compile("^\ufeff") 
with open("test.ris") as infile: 
    for line in infile: 
     line = line.rstrip() 
     if bom.match(line): 
      line = re.sub("^\ufeff", "", line) 
     if line: 
      if line.startswith("Record"): 
       print("START NEW RECORD") 
       count_records += 1 
       count_newlines = 0 
       current_record = {} 
       continue 
      match = prog.match(line) 
      tag = match.groups()[0] 
      field = match.groups()[1] 
      if tag == "AU": 
       if tag in current_record: 
        current_record[tag].append(field) 
       else: 
        current_record[tag] = [field] 
      else: 
       current_record.update({tag: field}) 
     else: 
      count_newlines += 1 
      if count_newlines > 1 and count_records > 0: 
       print("# of records: ", count_records) 
       print("# of newlines: ", count_newlines) 
       records.append(current_record)