2017-08-14 65 views
1

我有一個txt文件,看起來像這樣:閱讀文本文件作爲所需數據幀格式

Alabama[edit] 
    Auburn (Auburn University, Edward Via College of Osteopathic Medicine) 
    Birmingham (University of Alabama at Birmingham, Birmingham School of 
    Alaska[edit] 
    Anchorage[21] (University of Alaska Anchorage) 
    Fairbanks (University of Alaska Fairbanks)[16] 

我想看書txt文件作爲一個數據幀,看起來像這樣:

state  county 
Alabama Auburn 
Alabama Birmingham 
Alaska Anchorage 
Alaska Faibanks 

我至今是:

university_towns = open('university_towns.txt','r') 
df_university_towns = pd.DataFrame(columns={'State','RegionName'}) 
# loop over each line of the file object 
# determine if each line is state or county. 
# if the line has [edit], it's state 
for line in university_towns: 
    state_pattern = re.compile('\[edit\]') 
    state_pattern_m = state_pattern.search(line) 
    county_pattern = re.compile('(') 
    county_pattern_m = county_pattern.search(line) 
    if state_pattern_m: 
     #extract everything before \[edit] 
     print(state_pattern_m.start()) 
     end_position = state_pattern_m.start() 
     print(line[0:end_position]) 
     state_name = line[0:end_position] 
    if county_pattern_m: 
     #extract everything before (

這個代碼將只給我這樣的:

State County 
Alabama Auburn 
     Birminham 
. 
. 
. 

回答

0

這應做到:

key = None 

for line in t: 
    if '[edit]' in line: 
     key = line.replace('[edit]', '') 
     continue 
    if key: 
     # Use regex to extrac what you need 
     print(key, line.split(' ')[0]) 

我不知道你的數據看起來像這樣改變正則​​表達式從標題中刪除[](猜測這是一個標題),並有可能在使用正則表達式'[edit]]的位置在