如何解析CSV用括號和缺失值之間逗號

我試圖使用pyparsing解析一個CSV用：括號之間如何解析CSV用括號和缺失值之間逗號

逗號（或托架等）：「一個（1,2），b「應返回列表[」a（1,2）「，」b「]
缺失值：」a，b ,, c「應返回列表['a'，'b' ，''，'c'，'']

我工作的解決方案，但它似乎「髒」。大體上，Optional內唯一可能的原子公司之一。我認爲可選應該獨立於原子。也就是說，我覺得它應該在其他地方放，例如在delimitedList可選參數，但在我的試驗和錯誤，這是隻有工作，是有意義的地方。它可能在任何可能的原子中，所以我選擇了第一個。

另外，我不完全瞭解什麼originalTextFor是做什麼，但如果我刪除它，它停止工作。

工作例如：

import pyparsing as pp 

# Function that parses a line of columns separated by commas and returns a list of the columns 
def fromLineToRow(line): 
    sqbrackets_col = pp.Word(pp.printables, excludeChars="[],") | pp.nestedExpr(opener="[",closer="]") # matches "a[1,2]" 
    parens_col = pp.Word(pp.printables, excludeChars="(),") | pp.nestedExpr(opener="(",closer=")")  # matches "a(1,2)" 
    # In the following line: 
    # * The "^" means "choose the longest option" 
    # * The "pp.Optional" can be in any of the expressions separated by "^". I put it only on the first. It's used for when there are missing values 
    atomic = pp.originalTextFor(pp.Optional(pp.OneOrMore(parens_col)))^pp.originalTextFor(pp.OneOrMore(sqbrackets_col)) 

    grammar = pp.delimitedList(atomic) 

    row = grammar.parseString(line).asList() 
    return row 

file_str = \ 
"""YEAR,a(2,3),b[3,4] 
1960,2.8,3 
1961,4, 
1962,,1 
1963,1.27,3""" 

for line in file_str.splitlines(): 
    row = fromLineToRow(line) 
    print(row)

打印：

['YEAR', 'a(2,3)', 'b[3,4]'] 
['1960', '2.8', '3'] 
['1961', '4', ''] 
['1962', '', '1'] 
['1963', '1.27', '3']

這是這樣做的正確方法？是否有一個「乾淨」的方式來使用裏面的第一個原子的Optional？

來源

2017-05-31 Alechan

工作由內而外，我得到這個：

# chars not in()'s or []'s - also disallow ',' 
non_grouped = pp.Word(pp.printables, excludeChars="[](),") 

# grouped expressions in()'s or []'s 
grouped = pp.nestedExpr(opener="[",closer="]") | pp.nestedExpr(opener="(",closer=")") 

# use OneOrMore to allow non_grouped and grouped together 
atomic = pp.originalTextFor(pp.OneOrMore(non_grouped | grouped)) 
# or based on your examples, you *could* tighten this up to: 
# atomic = pp.originalTextFor(non_grouped + pp.Optional(grouped))

originalTextFor重組匹配表達式的開頭和結尾的邊界內的原始輸入文本，並返回一個字符串。如果你離開了這一點，那麼你會得到所有的子表達式字符串中的嵌套列表，像['a',['2,3']]。您可以與一再呼籲重新加入他們''.join，但會崩掉空格（或使用' '.join，但有可能引入空白的問題正好相反）。

要optionalize列表的元素，只是在分隔列表定義這樣說：

grammar = pp.delimitedList(pp.Optional(atomic, default=''))

一定要添加的默認值，否則空槽只會得到下降。

隨着這些改變我得到：

['YEAR', 'a(2,3)', 'b[3,4]'] 
['1960', '2.8', '3'] 
['1961', '4', ''] 
['1962', '', '1'] 
['1963', '1.27', '3']

來源

2017-05-31 19:08:08 PaulMcG

對於數值的分析時轉換，將'atomic'更改爲：'atomic = pp.pyparsing_common.number | pp.originalTextFor（...等）。 – PaulMcG

什麼，你可以使用正則表達式re，比如做：

>>> import re 
>>> re.split(r',\s*(?![^()]*\))', line1) 
['a(1,2)', 'b'] 
>>> re.split(r',\s*(?![^()]*\))', line2) 
['a', 'b', '', 'c', '']

來源

2017-05-31 16:08:51 haifzhan

LINE1解析應該是[ 「一個（1,2）」，「B」]代替[ '一個（1'， '2）'，「B ']（圓括號內的逗號不應該是分隔符） – Alechan

@Alechan請參閱我的更新請 – haifzhan

是的，這是我嘗試pyparsing之前的第一種方法，但是當我開始添加方括號或任何其他類型的嵌套表達式時，那麼正則表達式越來越模糊 – Alechan

import re 

with open('44289614.csv') as f: 
    for line in map(str.strip, f): 
     l = re.split(',\s*(?![^()[]]*[\)\]])', line) 
     print(len(l), l)

輸出：

3 ['YEAR', 'a(2,3)', 'b[3,4]'] 
3 ['1960', '2.8', '3'] 
3 ['1961', '4', ''] 
3 ['1962', '', '1'] 
3 ['1963', '1.27', '3']

從this answer修改。

我也喜歡this answer，建議稍微修改輸入並使用csv模塊的quotechar。

來源

2017-05-31 16:59:47

如何解析CSV用括號和缺失值之間逗號

回答

相關問題