通過正則表達式和/或Python從文本文件中提取信息

我正在處理大量文件（值大約4GB），它們都包含1到100個條目之間的任何地方，格式如下（兩個***之間是一個條目）：通過正則表達式和/或Python從文本文件中提取信息

*** 
Type:status 
Origin: @z_rose yes 
Text: yes 
URL: 
ID: 95482459084427264 
Time: Mon Jul 25 08:16:06 CDT 2011 
RetCount: 0 
Favorite: false 
MentionedEntities: 20776334 
Hashtags: 
*** 
*** 
Type:status 
Origin: @aaronesilvers text 
Text: text 
URL: 
ID: 95481610861953024 
Time: Mon Jul 25 08:12:44 CDT 2011 
RetCount: 0 
Favorite: false 
MentionedEntities: 2226621 
Hashtags: 
*** 
*** 
Type:status 
Origin: @z_rose text 
Text: text and stuff 
URL: 
ID: 95480980026040320 
Time: Mon Jul 25 08:10:14 CDT 2011 
RetCount: 0 
Favorite: false 
MentionedEntities: 20776334 
Hashtags: 
***

現在我想以某種方式將這些項目導入大熊貓進行質量分析，但很明顯，我不得不將其轉換成格式大熊貓可以處理。所以我想寫的是，上述轉換到.csv看起來像這樣（用戶是文件標題）的腳本：

User Type Origin    Text URL ID    Time       RetCount Favorite MentionedEntities Hashtags 
4012987 status @z_rose yes   yes Null 95482459084427264 Mon Jul 25 08:16:06 CDT 2011 0   false 20776334   Null 
4012987 status @aaronsilvers text text Null 95481610861953024 Mon Jul 25 08:12:44 CDT 2011 0   false 2226621   Null

（格式是不完美的，但希望你的想法）

我已經有一些代碼工作的基礎上，它經常在12的信息段，但不幸的是，一些文件包含一些領域的幾個whitelines。什麼我基本上希望做的是：

fields[] =['User', 'Type', 'Origin', 'Text', 'URL', 'ID', 'Time', 'RetCount', 'Favorite', 'MentionedEntities', 'Hashtags'] 
starPair = 0; 
User = filename; 
read(file) 
#Determine if the current entry has ended 
if(stringRead=="***"){ 
    if(starPair == 0) 
     starPair++; 
    if(starPair == 1){ 
     row=row++; 
     starPair = 0; 
    } 
} 
#if string read matches column field 
if(stringRead == fields[]) 
    while(strRead != fields[]) #until next field has been found 
     #extract all characters into correct column field

然而，問題出現某些字段可以包含的字段的字[] ..我可以檢查一個\ n字符第一，這將大大減少量的錯誤條目，但不會消除它們。

任何人都可以指向正確的方向嗎？

在此先感謝！

來源

2017-05-31 user3394131

用戶來自哪裏？ – depperm

哦，我的壞，用戶從文本文件名稱中提取（所有文本文件都是由用戶ID）。 – user3394131

也許只是嘗試按「***」拆分，然後用換行符拆分結果？將它們連接到一個字符串並將其打印到文本文件中。 – Eswemenasja

你的代碼/僞代碼看起來不像python，但是因爲你有python標籤，所以我會這樣做。首先，將文件讀入一個字符串，然後遍歷每個字段並製作一個正則表達式來查找後面的值，將結果推送到2d列表中，然後將該2d列表輸出爲CSV。此外，您的CSV看起來更像是一個TSV（製表符分隔，而不是逗號分隔）。

import re 
import csv 

filename='4012987' 
User=filename 

# read your file into a string 
with open(filename, 'r') as myfile: 
    data=myfile.read() 

fields =['Type', 'Origin', 'Text', 'URL', 'ID', 'Time', 'RetCount', 'Favorite', 'MentionedEntities', 'Hashtags'] 
csvTemplate = [['User','Type', 'Origin', 'Text', 'URL', 'ID', 'Time', 'RetCount', 'Favorite', 'MentionedEntities', 'Hashtags']] 

# for each field use regex to get the entry 
for n,field in enumerate(fields): 
    matches = re.findall(field+':\s?([^\n]*)\n+', data) 
    # this should run only the first time to fill your 2d list with the right amount of lists 
    while len(csvTemplate)<=len(matches): 
    csvTemplate.append([None]*(len(fields)+1)) # Null isn't a python reserved word 
    for e,m in enumerate(matches): 
    if m != '': 
     csvTemplate[e+1][n+1]=m.strip() 
# set the User column 
for i in range(1,len(csvTemplate)): 
    csvTemplate[i][0] = User 
# output to csv....if you want tsv look at https://stackoverflow.com/a/29896136/3462319 
with open("output.csv", "wb") as f: 
    writer = csv.writer(f) 
    writer.writerows(csvTemplate)

來源

2017-05-31 15:13:00 depperm

我的筆記本電腦電池工作不正常，希望在週末能夠測試這個！無論如何謝謝你的答案！ – user3394131

嗨，我的筆記本電腦現在已經修復，對於延遲應答表示歉意。我不得不將「wb」改成「w」，因爲它不會運行。謝謝！ – user3394131

只是想跟進，它似乎是完美的工作。最後我實際上擁有將近20GB的數據，而且我測試過的所有樣本都非常完美。非常感謝！ – user3394131

您可以使用正則表達式和一個字典理解的組合：

import regex as re, pandas as pd 

rx_parts = re.compile(r'^{}$(?s:.*?)^{}$'.format(re.escape('***'), re.escape('***')), re.MULTILINE) 
rx_entry = re.compile(r'^(?P<key>\w+):[ ]*(?P<value>.+)$', re.MULTILINE) 

result = ({m.group('key'): m.group('value') 
      for m in rx_entry.finditer(part.group(0))} 
      for part in rx_parts.finditer(your_string_here)) 

df = pd.DataFrame(result) 
print(df)

其中產量

Favorite Hashtags     ID MentionedEntities    Origin \ 
0 false   95482459084427264   20776334   @z_rose yes 
1 false   95481610861953024   2226621 @aaronesilvers text 
2 false   95480980026040320   20776334   @z_rose text 

    RetCount   Text       Time Type URL 
0  0    yes Mon Jul 25 08:16:06 CDT 2011 status  
1  0   text Mon Jul 25 08:12:44 CDT 2011 status  
2  0 text and stuff Mon Jul 25 08:10:14 CDT 2011 status

說明：

鴻溝字符串轉換成不同的部分，由***兩側
查找每行
鍵值對把所有對在一個字典

我們最終不得不字典的發電機包圍然後我們將其輸入pandas。

提示：

的代碼沒有被大量數據的測試，尤其是不4GB。此外，您需要使用較新的regex模塊才能使表達式正常工作。

來源

2017-05-31 17:06:29 Jan

我的筆記本電腦電池無法正常工作，我希望能夠在週末期間對此進行測試！無論如何謝謝你的答案！ – user3394131

不得不訂購新電池，我的筆記本電腦終於再次工作，對延遲應答表示歉意。雖然我發現了以下錯誤： 'A：\ Programmas \蟒蛇\在_parse LIB \ sre_parse.py（源，州） 760破 761如果字符不在標誌： - > 762升源。錯誤（「未知標誌」，len（char）） 763 verbose = state.flags＆SRE_FLAG_VERBOSE 764 continue ' 我似乎無法弄清楚如何解決它。 – user3394131

通過正則表達式和/或Python從文本文件中提取信息

回答

相關問題