2017-09-26 85 views
0

試圖用混合的csv格式來處理混亂的4GB txt文件。數據有大約38個'分隔符'定義的''''(下面的例子)數據使用逗號作爲字段分隔符輸出,但也有逗號與數據內聯,這導致難以導入大多數平臺。我相信使用awk/sed/cat我可以修復數據,每列數據都可以用引號來定義,我只是想不出如何去做。在兩組引用中,所有逗號替換爲句點或類似的內容,包含逗號的部分位於我的列的中間,而不是數據集中的最後一個字段,我試圖撕掉部分用逗號加awk,用sed替換它們,然後用cat將其粘貼迴文件中。使用awk解析帶有分隔符的變量文本的分隔符

實際數據很敏感,無法共享,下面的例子雖然是類似的。

數據樣本:

"identifier","Status","Name","City","Application","Job","Details","column 39" 
"red","paid","Dave","Philadelphia","55823","Cashier","No commas in this comment","spare1" 
"rojo","past due","Steve","San Francisco","78434","trainer","Does not like sushi, beer, or ham","spare2" 
"verde","pending","Duncan","Columbus","65478","CEO","Late for work, on the fifth","spare3" 

期望的結果是專注於改變逗號,而「39列」後添加數據回inline或在年底

"identifier","Status","Name","City","Application","Job","Details","column 39" 
"red","paid","Dave","Philadelphia","55823","Cashier","No commas in this comment","spare1" 
"rojo","past due","Steve","San Francisco","78434","trainer","Does not like sushi. beer. or ham","spare2" 
"verde","pending","Duncan","Columbus","65478","CEO","Late for work. on the fifth","spare3" 

任何建議都非常感謝!

+0

0123緩解,併發布一些麻煩的數據與預期的輸出。我們不喜歡自己製作測試用例。 –

+0

@JamesBrown對缺乏內容表示歉意,我添加了幾行例子。 –

回答

0

您可以用sed去除內逗號狀

$ f1=$'"column 1","Column 2","Name","Address","Application","Job","Comments, about, items that also have, commas, inline","column 39"' 

$ echo "$f1" |sed -r 's/([^"]),([^"])/\1\2/g' 
"column 1","Column 2","Name","Address","Application","Job","Comments about items that also have commas inline","column 39" 

或者你可以用別的東西代替內逗號以後可以恢復到內​​部逗號:

$ f2=$(echo "$f1" |sed -r 's/([^"]),([^"])/\1-x2c-\2/g');echo "$f2"  "column 1","Column 2","Name","Address","Application","Job","Comments-x2c- about-x2c- items that also have-x2c- commas-x2c- inline","column 39" 
#or use sed -r 's/([^"]),([^"])/\1.\2/g' to replace inner commas with dots 

$ echo "$f2" |sed 's/-x2c-/,/g' 
"column 1","Column 2","Name","Address","Application","Job","Comments, about, items that also have, commas, inline","column 39" 

或者你可以使用一種awk來解析基於","而不僅僅是逗號的字段:

$ echo "$f1" |awk -vFPAT='[^,]*|"[^"]*"' '{print $1}' 
"column 1" 

$ echo "$f1" |awk -vFPAT='[^,]*|"[^"]*"' '{print $7}' 
"Comments, about, items that also have, commas, inline" 

$ echo "$f1" |awk -vFPAT='[^,]*|"[^"]*"' -vOFS="," '{print $1,$7}' 
"column 1","Comments, about, items that also have, commas, inline" 
+0

謝謝喬治,第二個實例工作完美,我能夠理解它。我確實需要切換到Linux來運行,忽略提及我是在OSX上運行的。 –