2017-02-16 51 views
0

我的文本文件看起來像這樣提取多個圖案,並將其保存到熊貓數據幀[巨蟒]

Description: Text 1 follows <br/> blah blah blah Cause: Cause Text 1 
follows here <br/>Description: Text 2 follows <br/> blah blah 
blah Cause: Cause Text 2 follows here<br/>Description: Text 3 follows <br/> 
blah blah blah Description: Text 4 follows <br/> blah blah 
blah Cause: Cause Text 4 follows<br/> 

我想擁有的所有說明,並導致了NLP結構化格式的熊貓數據幀

Description    Cause 
Text 1 follows  Cause Text 1 follows here 
Text 2 follows  Cause Text 2 follows here 
Text 3 follows  
Text 4 follows  Cause Text 4 follows here 

我迄今所做的:

re.findall(r'Description:(.*?)<br/>',textfile) 
re.findall(r'Cause:(.*?)<br/>',textfile) 

但是,這並不讓我墊當我嘗試創建更大的數據框時,說明和原因!

感謝您的任何輸入或指導做同樣的事情。對python很新穎!

+0

嘗試['R'說明(S):(?:P (:(?
))\ S *。*)
(:(:(?!說明:)?。 )*?原因:\ s *(?P (?:(?!
)。)*))?''](https://regex101.com/r/bRIOev/1) –

回答

0

這是我想出來的。

r"Description:(.*?)<br/>(?:(?!Cause)(?!Description).)*(?:Cause:(.*?)<br/>)?" 

如果你使用這個表達式,它匹配既是Description可選Cause,它將確保描述和原因的配對保持「拉鍊」正確。

data = re.findall(r"Description:(.*?)<br/>(?:(?!Cause)(?!Description).)*(?:Cause:(.*?)<br/>)?", textfile) 
df = pandas.DataFrame(data, columns=("Description", "Cause")) 
+0

完美:)謝謝! – 0Ajax0