提取多個圖案，並將其保存到熊貓數據幀[巨蟒]

我的文本文件看起來像這樣提取多個圖案，並將其保存到熊貓數據幀[巨蟒]

Description: Text 1 follows <br/> blah blah blah Cause: Cause Text 1 
follows here <br/>Description: Text 2 follows <br/> blah blah 
blah Cause: Cause Text 2 follows here<br/>Description: Text 3 follows <br/> 
blah blah blah Description: Text 4 follows <br/> blah blah 
blah Cause: Cause Text 4 follows<br/>

我想擁有的所有說明，並導致了NLP結構化格式的熊貓數據幀

Description    Cause 
Text 1 follows  Cause Text 1 follows here 
Text 2 follows  Cause Text 2 follows here 
Text 3 follows  
Text 4 follows  Cause Text 4 follows here

我迄今所做的：

re.findall(r'Description:(.*?)<br/>',textfile) 
re.findall(r'Cause:(.*?)<br/>',textfile)

但是，這並不讓我墊當我嘗試創建更大的數據框時，說明和原因！

感謝您的任何輸入或指導做同樣的事情。對python很新穎！

來源

2017-02-16 0Ajax0

嘗試['R'說明（S）：（？：P （:(？
））\ S *。*）
（:(:(?!說明:)？。）*？原因：\ s *（？P （？:(?!
）。）*））？'']（https://regex101.com/r/bRIOev/1） –

這是我想出來的。

r"Description:(.*?)<br/>(?:(?!Cause)(?!Description).)*(?:Cause:(.*?)<br/>)?"

如果你使用這個表達式，它匹配既是Description和可選Cause，它將確保描述和原因的配對保持「拉鍊」正確。

data = re.findall(r"Description:(.*?)<br/>(?:(?!Cause)(?!Description).)*(?:Cause:(.*?)<br/>)?", textfile) 
df = pandas.DataFrame(data, columns=("Description", "Cause"))

來源

2017-02-16 07:34:05

完美:)謝謝！ – 0Ajax0

提取多個圖案，並將其保存到熊貓數據幀[巨蟒]

回答

相關問題