我需要讀取一些大文件(從50k到100k行),這些文件以空行分隔的組結構。每個組以相同的模式「No.999999999 dd/mm/yyyy ZZZ」開始。這裏有一些示例數據。從大型結構化文本文件中提取信息
No.813829461 16/09/1987 270
Tit.SUZANO PAPELëCelulose的SA(BR/BA)
CNPJ/CIC/N INPI:16404287000155
Procurador:MARCELLO DO NASCIMENTONo.815326777 28/12/1989 351
Tit.SIGLA SISTEMA GLOBO DE GRAVACOES AUDIO VISUAIS LTDA(BR/RJ)
CNPJ/CIC /NºINPI:34162651000108
閱.: Nominativa;納特:德Produto
馬卡報:TRIO熱帶
Clas.Prod/Serv:09.40
* DEFERIDO CONFORMERESOLUÇÃO123 DE 2006年6月1日,PUBLICADA NA RPI 1829,DE 24/01/2006。
Procurador:WALDEMAR·羅德里格斯PEDRANo.900148764 2007年11月1日LD3
Tit.TIARA BOLSASËCALÇADOSLTDA
Procurador:瑪西婭費雷拉戈麥斯
*Escritório:MARCAS MarcantesêPATENTES LTDA
*Exigência Formalnãoresponida Satisfatoriamente,Pedido de Registro de Marca Considerado inexistente,de acordo com Art。 157達LPI
* Protocolo達Petição德cumprimento德Exigência形式:810080140197
我寫了一些代碼that's相應的解析它。有什麼我可以改進的,以提高可讀性或性能?這裏是我到目前爲止:
import re, pprint
class Despacho(object):
"""
Class to parse each line, applying the regexp and storing the results
for future use
"""
regexp = {
re.compile(r'No.([\d]{9}) ([\d]{2}/[\d]{2}/[\d]{4}) (.*)'): lambda self: self._processo,
re.compile(r'Tit.(.*)'): lambda self: self._titular,
re.compile(r'Procurador: (.*)'): lambda self: self._procurador,
re.compile(r'C.N.P.J./C.I.C./N INPI :(.*)'): lambda self: self._documento,
re.compile(r'Apres.: (.*) ; Nat.: (.*)'): lambda self: self._apresentacao,
re.compile(r'Marca: (.*)'): lambda self: self._marca,
re.compile(r'Clas.Prod/Serv: (.*)'): lambda self: self._classe,
re.compile(r'\*(.*)'): lambda self: self._complemento,
}
def __init__(self):
"""
'complemento' is the only field that can be multiple in a single registry
"""
self.complemento = []
def _processo(self, matches):
self.processo, self.data, self.despacho = matches.groups()
def _titular(self, matches):
self.titular = matches.group(1)
def _procurador(self, matches):
self.procurador = matches.group(1)
def _documento(self, matches):
self.documento = matches.group(1)
def _apresentacao(self, matches):
self.apresentacao, self.natureza = matches.groups()
def _marca(self, matches):
self.marca = matches.group(1)
def _classe(self, matches):
self.classe = matches.group(1)
def _complemento(self, matches):
self.complemento.append(matches.group(1))
def read(self, line):
for pattern in Despacho.regexp:
m = pattern.match(line)
if m:
Despacho.regexp[pattern](self)(m)
def process(rpi):
"""
read data and process each group
"""
rpi = (line for line in rpi)
group = False
for line in rpi:
if line.startswith('No.'):
group = True
d = Despacho()
if not line.strip() and group: # empty line - end of block
yield d
group = False
d.read(line)
arquivo = open('rm1972.txt') # file to process
for desp in process(arquivo):
pprint.pprint(desp.__dict__)
print('--------------')
我喜歡你定義正則表達式的方式。讀取和維護起來更容易,因爲我不需要定義一堆函數來存儲這些值。 – 2009-01-27 10:11:05