從大型結構化文本文件中提取信息

我需要讀取一些大文件（從50k到100k行），這些文件以空行分隔的組結構。每個組以相同的模式「No.999999999 dd/mm/yyyy ZZZ」開始。這裏有一些示例數據。從大型結構化文本文件中提取信息

No.813829461 16/09/1987 270
Tit.SUZANO PAPELëCelulose的SA（BR/BA）
CNPJ/CIC/N INPI：16404287000155
Procurador：MARCELLO DO NASCIMENTO

No.815326777 28/12/1989 351
Tit.SIGLA SISTEMA GLOBO DE GRAVACOES AUDIO VISUAIS LTDA（BR/RJ）
CNPJ/CIC /NºINPI：34162651000108
閱.: Nominativa;納特：德Produto
馬卡報：TRIO熱帶
Clas.Prod/Serv：09.40
* DEFERIDO CONFORMERESOLUÇÃO123 DE 2006年6月1日，PUBLICADA NA RPI 1829，DE 24/01/2006。
Procurador：WALDEMAR·羅德里格斯PEDRA

No.900148764 2007年11月1日LD3
Tit.TIARA BOLSASËCALÇADOSLTDA
Procurador：瑪西婭費雷拉戈麥斯
*Escritório：MARCAS MarcantesêPATENTES LTDA
*Exigência Formalnãoresponida Satisfatoriamente，Pedido de Registro de Marca Considerado inexistente，de acordo com Art。 157達LPI
* Protocolo達Petição德cumprimento德Exigência形式：810080140197

我寫了一些代碼that's相應的解析它。有什麼我可以改進的，以提高可讀性或性能？這裏是我到目前爲止：

import re, pprint 

class Despacho(object): 
    """ 
    Class to parse each line, applying the regexp and storing the results 
    for future use 
    """ 
    regexp = { 
     re.compile(r'No.([\d]{9}) ([\d]{2}/[\d]{2}/[\d]{4}) (.*)'): lambda self: self._processo, 
     re.compile(r'Tit.(.*)'): lambda self: self._titular, 
     re.compile(r'Procurador: (.*)'): lambda self: self._procurador, 
     re.compile(r'C.N.P.J./C.I.C./N INPI :(.*)'): lambda self: self._documento, 
     re.compile(r'Apres.: (.*) ; Nat.: (.*)'): lambda self: self._apresentacao, 
     re.compile(r'Marca: (.*)'): lambda self: self._marca, 
     re.compile(r'Clas.Prod/Serv: (.*)'): lambda self: self._classe, 
     re.compile(r'\*(.*)'): lambda self: self._complemento, 
    } 

    def __init__(self): 
     """ 
     'complemento' is the only field that can be multiple in a single registry 
     """ 
     self.complemento = [] 

    def _processo(self, matches): 
     self.processo, self.data, self.despacho = matches.groups() 

    def _titular(self, matches): 
     self.titular = matches.group(1) 

    def _procurador(self, matches): 
     self.procurador = matches.group(1) 

    def _documento(self, matches): 
     self.documento = matches.group(1) 

    def _apresentacao(self, matches): 
     self.apresentacao, self.natureza = matches.groups() 

    def _marca(self, matches): 
     self.marca = matches.group(1) 

    def _classe(self, matches): 
     self.classe = matches.group(1) 

    def _complemento(self, matches): 
     self.complemento.append(matches.group(1)) 

    def read(self, line): 
     for pattern in Despacho.regexp: 
      m = pattern.match(line) 
      if m: 
       Despacho.regexp[pattern](self)(m) 


def process(rpi): 
    """ 
    read data and process each group 
    """ 
    rpi = (line for line in rpi) 
    group = False 

    for line in rpi: 
     if line.startswith('No.'): 
      group = True 
      d = Despacho()   

     if not line.strip() and group: # empty line - end of block 
      yield d 
      group = False 

     d.read(line) 


arquivo = open('rm1972.txt') # file to process 
for desp in process(arquivo): 
    pprint.pprint(desp.__dict__) 
    print('--------------')

來源

2009-01-26 Luiz Damim

這很不錯。下面的一些建議，讓我知道如果你喜歡他們：

import re 
import pprint 
import sys 

class Despacho(object): 
    """ 
    Class to parse each line, applying the regexp and storing the results 
    for future use 
    """ 
    #used a dict with the keys instead of functions. 
    regexp = { 
     ('processo', 
     'data', 
     'despacho'): re.compile(r'No.([\d]{9}) ([\d]{2}/[\d]{2}/[\d]{4}) (.*)'), 
     ('titular',): re.compile(r'Tit.(.*)'), 
     ('procurador',): re.compile(r'Procurador: (.*)'), 
     ('documento',): re.compile(r'C.N.P.J./C.I.C./N INPI :(.*)'), 
     ('apresentacao', 
     'natureza'): re.compile(r'Apres.: (.*) ; Nat.: (.*)'), 
     ('marca',): re.compile(r'Marca: (.*)'), 
     ('classe',): re.compile(r'Clas.Prod/Serv: (.*)'), 
     ('complemento',): re.compile(r'\*(.*)'), 
    } 

    def __init__(self): 
     """ 
     'complemento' is the only field that can be multiple in a single registry 
     """ 
     self.complemento = [] 


    def read(self, line): 
     for attrs, pattern in Despacho.regexp.iteritems(): 
      m = pattern.match(line) 
      if m: 
       for groupn, attr in enumerate(attrs): 
        # special case complemento: 
        if attr == 'complemento': 
         self.complemento.append(m.group(groupn + 1)) 
        else: 
         # set the attribute on the object 
         setattr(self, attr, m.group(groupn + 1)) 

    def __repr__(self): 
     # defines object printed representation 
     d = {} 
     for attrs in self.regexp: 
      for attr in attrs: 
       d[attr] = getattr(self, attr, None) 
     return pprint.pformat(d) 

def process(rpi): 
    """ 
    read data and process each group 
    """ 
    #Useless line, since you're doing a for anyway 
    #rpi = (line for line in rpi) 
    group = False 

    for line in rpi: 
     if line.startswith('No.'): 
      group = True 
      d = Despacho()   

     if not line.strip() and group: # empty line - end of block 
      yield d 
      group = False 

     d.read(line) 

def main(): 
    arquivo = open('rm1972.txt') # file to process 
    for desp in process(arquivo): 
     print desp # can print directly here. 
     print('-' * 20) 
    return 0 

if __name__ == '__main__': 
    main()

來源

2009-01-27 00:55:51 nosklo

我喜歡你定義正則表達式的方式。讀取和維護起來更容易，因爲我不需要定義一堆函數來存儲這些值。 – 2009-01-27 10:11:05

如果您有特定的問題，這將是更容易的幫助。性能將取決於您正在使用的特定正則表達式引擎的效率。單個文件中的100K行聽起來不那麼大，但這又取決於您的環境。

我在我的.NET開發中使用Expresso來測試表達式的準確性和性能。 Google搜索出現Kodos，一款GUI Python正則表達式創作工具。

來源

2009-01-27 00:04:04

It's不喜歡我正嘗試在這裏做過早的優化。這是我在Python中的第一個具體實現（來自PHP背景），我只是想知道我是否正確。 :) – 2009-01-27 10:06:52

它看起來很好的整體，但爲什麼你也行：

rpi = (line for line in rpi)

您已經可以遍歷文件對象沒有這個中間步驟。

來源

2009-01-27 00:40:43 Kiv

你是對的，該行是完全無用的。我忘了打開一個文件已經返回一個生成器。謝謝。 – 2009-01-27 10:07:23

我不會在這裏使用正則表達式。如果你知道你的行將以固定字符串開始，爲什麼不檢查這些字符串並在其周圍寫一個邏輯？

for line in open(file): 
    if line[0:3]=='No.': 
     currIndex='No' 
     map['No']=line[4:] 
    .... 
    ... 
    else if line.strip()=='': 
     //store the record in the map and clear the map 
    else: 
     //append line to the last index in map.. this is when the record overflows to the next line. 
     Map[currIndex]=Map[currIndex]+"\n"+line

考慮上面的代碼只是僞代碼。

來源

2009-01-27 19:29:55

另一個版本只有一個組合的正則表達式：

#!/usr/bin/python 

import re 
import pprint 
import sys 

class Despacho(object): 
    """ 
    Class to parse each line, applying the regexp and storing the results 
    for future use 
    """ 
    #used a dict with the keys instead of functions. 
    regexp = re.compile(
     r'No.(?P<processo>[\d]{9}) (?P<data>[\d]{2}/[\d]{2}/[\d]{4}) (?P<despacho>.*)' 
     r'|Tit.(?P<titular>.*)' 
     r'|Procurador: (?P<procurador>.*)' 
     r'|C.N.P.J./C.I.C./N INPI :(?P<documento>.*)' 
     r'|Apres.: (?P<apresentacao>.*) ; Nat.: (?P<natureza>.*)' 
     r'|Marca: (?P<marca>.*)' 
     r'|Clas.Prod/Serv: (?P<classe>.*)' 
     r'|\*(?P<complemento>.*)') 

    simplefields = ('processo', 'data', 'despacho', 'titular', 'procurador', 
        'documento', 'apresentacao', 'natureza', 'marca', 'classe') 

    def __init__(self): 
     """ 
     'complemento' is the only field that can be multiple in a single 
     registry 
     """ 
     self.__dict__ = dict.fromkeys(self.simplefields) 
     self.complemento = [] 

    def parse(self, line): 
     m = self.regexp.match(line) 
     if m: 
      gd = dict((k, v) for k, v in m.groupdict().items() if v) 
      if 'complemento' in gd: 
       self.complemento.append(gd['complemento']) 
      else: 
       self.__dict__.update(gd) 

    def __repr__(self): 
     # defines object printed representation 
     return pprint.pformat(self.__dict__) 

def process(rpi): 
    """ 
    read data and process each group 
    """ 
    d = None 

    for line in rpi: 
     if line.startswith('No.'): 
      if d: 
       yield d 
      d = Despacho() 
     d.parse(line) 
    yield d 

def main(): 
    arquivo = file('rm1972.txt') # file to process 
    for desp in process(arquivo): 
     print desp # can print directly here. 
     print '-' * 20 

if __name__ == '__main__': 
    main()

來源

2009-01-27 21:42:46 akaihola

從大型結構化文本文件中提取信息

回答

相關問題