Python - 如何將文本輸入拆分爲單獨的元素

輸入將與換行符不一致，因此我不能使用換行符作爲某種分隔符。未來在該文本將在以下格式：Python - 如何將文本輸入拆分爲單獨的元素

的IDNumber名姓得分函位置

的IDNumber：9號

分數：0-100

字母：A或B

位置：可以是任何從縮寫州名到城市和州的完整拼寫。這是可選的。

例：

123456789 John Doe 90 A New York City 987654321 
Jane Doe 70 B CAL 432167895 John 

Cena 60 B FL 473829105 Donald Trump 70 E 
098743215 Bernie Sanders 92 A AR

元素是：

123456789 John Doe 90 A New York City 
987654321 Jane Doe 70 B CAL 
432167895 John Cena 60 B FL 
473829105 Donald Trump 70 E 
098743215 Bernie Sanders 92 A AR

我需要爲每個人單獨訪問每個元素。因此，對於John Cena對象，我需要能夠訪問ID：432167895，名字：John，姓氏：Cena，B或A：B。我並不真的需要位置，但它將成爲輸入的一部分。

編輯：應該值得一提的是我不允許導入任何模塊，如正則表達式。

來源

2017-04-19 Jackson Blankenship

如果輸入是一個字符串，我會通過[分裂上的空白字符字符串]啓動（http://stackoverflow.com/questions/8113782/split-string-on-whitespace-in-python）。 –

有可能是一個更優雅的方式來做到這一點，但基於一個例子字符串輸入下面是一個想法。

input = "123456789 John Doe 90 A New York City 987654321 Jane Doe 70 B CAL 473829105 Donald Trump 70 E 098743215 Bernie Sanders 92 A AR" 

#split by whitespaces 
output = input.split() 

#create output to store as dictionary this could then be dumped to a json file 
data = {'output':[]} 
end = len(output) 

i=0 

while i< end: 
    tmp = {} 
    tmp['id'] = output[i] 
    i=i+1 
    tmp['fname']=output[i] 
    i=i+1 
    tmp['lname']=output[i] 
    i=i+1 
    tmp['score']=output[i] 
    i=i+1 
    tmp['letter']=output[i] 
    i=i+1 
    location = "" 
    #Catch index out of bounds errors 
    try: 
     bool = output[i].isdigit() 
     while not bool: 
      location = location + " " + output[i] 
      i=i+1 
      bool = output[i].isdigit() 
    except IndexError: 
     print('Completed Array') 

    tmp['location'] = location 
    data['output'].append(tmp) 

print(data)

來源

2017-04-19 21:54:41

除非未指定位置，否則此作品完美無缺！你知道如何解決它嗎？位置元素是可選的。 –

我做了一個更新，只是在沒有任何東西的情況下將空字符串放在位置中。 –

你可以使用正則表達式，這需要每個記錄開始一個9位數的號碼，以言聯在必要時，並跳過位置：

res = re.findall(r"(\d{9})\s+(\S*)\s+(\S*(?:\s+\D\S*)*)\s+(\d+)\s+(\S*)", data)

Result是：

[('123456789', 'John', 'Doe', '90', 'A'), 
('987654321', 'Jane', 'Doe', '70', 'B'), 
('432167895', 'John', 'Cena', '60', 'B'), 
('473829105', 'Donald', 'Trump', '70', 'E'), 
('098743215', 'Bernie', 'Sanders', '92', 'A')]

來源

2017-04-19 21:17:14 trincot

由於在空白分裂不是爲位置的識別有幫助，我會直接去一個正則表達式：

import re 

input_string = """123456789 John Doe 90 A New York City 987654321 
Jane Doe 70 B CAL 432167895 John 

Cena 60 B FL 473829105 Donald Trump 70 E 
098743215 Bernie Sanders 92 A AR""" 

search_string=re.compile(r"([0-9]{9})\W+([a-zA-Z ]+)\W+([a-zA-Z ]+)\W+([0-9]{1,3})\W+([AB])\W+([a-zA-Z ]+)\W+") 
person_list = re.findall(search_string, input_string)

只

這產生了：

ID：9個位數（後面至少一個空白）

姓和名：2個獨立

[('123456789', 'John', 'Doe', '90', 'A', 'New York City'), 
('987654321', 'Jane', 'Doe', '70', 'B', 'CAL'), 
('432167895', 'John', 'Cena', '60', 'B', 'FL')]

在正則表達式的基團的說明

得分：一個，兩個或三個數字（後面至少有一個空格）
字母：A或B（隨後通過至少一個空白）
位置：一組字符（接着通過至少一個空白）

來源

2017-04-19 21:20:40

自從你知道的ID號將是在啓動每個「記錄」的，是9位數字，由9位數的ID號試圖分裂：

# Assuming your file is read in as a string s: 
import re 
records = re.split(r'[ ](?=[0-9]{9}\b)', s) 

# record locator will end up holding your records as: {'<full name>' -> {'ID'-><ID value>, 'FirstName'-><FirstName value>, 'LastName'-><LastName value>, 'Letter'-><LetterValue>}, 'full name 2'->{...} ...} 
record_locator = {} 

field_names = ['ID', 'FirstName', 'LastName', 'Letter'] 

# Get the individual records and store their values: 
for record in records: 

    # You could filter the record string before doing this if it contains newlines etc 
    values = record.split(' ')[:5] 

    # Discard the int after the name eg. 90 in the first record 
    del values[3] 

    # Create a new entry for the full name. This will overwrite entries with the same name so you might want to use a unique id instead 
    record_locator[values[1]+values[2]] = dict(zip(field_names, values))

然後訪問信息：

print record_locator['John Doe']['ID'] # 987654321

來源

2017-04-19 21:26:00 sgrg

我認爲試圖按9位數字拆分可能是最好的選擇。

import re 

with open('data.txt') as f: 
    data = f.read() 
    results = re.split(r'(\d{9}[\s\S]*?(?=[0-9]{9}))', data) 
    results = list(filter(None, results)) 
    print(results)

給我這些結果

['123456789 John Doe 90 A New York City ', '987654321\nJane Doe 70 B CAL ', '432167895 John\n\nCena 60 B FL ', '473829105 Donald Trump 70 E\n', '098743215 Bernie Sanders 92 A AR']

來源

2017-04-19 21:31:32 davidejones

Python - 如何將文本輸入拆分爲單獨的元素

回答

相關問題