將文件中的字段添加到另一個基於ID的字段

基本上，我需要一個可以在極短時間內解決問題的腳本。我有兩個文件：將文件中的字段添加到另一個基於ID的字段

$頭-n 6 fcu.tsv

NM576455  0.324009324  0.578896174  2577 
NM539570  0.204545455  0.607877092  2247 
NM337132  0.288973384  0.673636364  792 
NM374379  0.308300395  0.42   762 
NM373443  0.263043478  0.547132867  1383 
NM371839  0.298210736  0.492857143  1512

$頭-n 6 mart.tsv

NM539570 ILMN_2199362 15  58.52 protein_coding 
NM576455 ILMN_2195138 1  65.74 protein_coding nucleus cellular_component  SAM_2 
NM576455 ILMN_2195138 1  65.74 protein_coding protein binding molecular_function  SAM_2 
NM576455 ILMN_1709067 1  65.74 protein_coding nucleus cellular_component  SAM_2 
NM576455 ILMN_1709067 1  65.74 protein_coding protein binding molecular_function  SAM_2 
NM576455 ILMN_2195138 1  65.74 protein_coding nucleus cellular_component  SAM_type1

我們需要追加的第二，第三和第四在很短的時間內爲每個NM ID分配fcu.tsv到mart.tsv的字段。

$頭out.tsv

NM539570 ILMN_2199362 15  58.52 protein_coding 0.204545455  0.607877092  2247 
NM576455 ILMN_2195138 1  65.74 protein_coding nucleus cellular_component  SAM_2 0.324009324 0.578896174  2577 
    NM576455 ILMN_2195138 1  65.74 protein_coding protein binding molecular_function  SAM_2 0.324009324 0.578896174  2577 
    NM576455 ILMN_1709067 1  65.74 protein_coding nucleus cellular_component  SAM_2 0.324009324 0.578896174  2577 
    NM576455 ILMN_1709067 1  65.74 protein_coding protein binding molecular_function  SAM_2 0.324009324 0.578896174  2577 
    NM576455 ILMN_2195138 1  65.74 protein_coding nucleus cellular_component  SAM_type1 0.324009324 0.578896174  2577

這是我在MATLAB做（我喜歡這個解決方案修復該錯誤代碼在這裏，使其更快，而不是寫一個新的）

fr1 = fopen('fcu.tsv', 'r'); 
fr2 = fopen('mart.tsv', 'r'); 

fw = fopen('out.tsv', 'w'); 

while feof(fr1) == 0 
    line = fgetl(fr1); 
    scan = textscan(line, '%s%f%f%d'); 

    frewind(fr2); 

    while feof(fr2) == 0 
     line2 = fgetl(fr2); 
     scan2 = textscan(line2, '%s%s%s%f%s%s%s%s'); 

      if scan{1}{1} == scan2{1}{1} 

       fprintf(fw, '%s\t%f\t%f\t%d\n', line2, scan{2}, scan{3}, scan{4}); 

      end 

    end 

end

幫助表示讚賞

來源

2012-07-15 Loca Toney

這是一個以命令行爲中心的解決方案，適用於支持coreutils的任何系統，如果它不適用於您的情況，則表示歉意。

如果mart.tsv被正確填充，如：

NM539570 ILMN_2199362 15  58.52 protein_coding NA  NA     NA      NA 
NM576455 ILMN_2195138 1  65.74 protein_coding nucleus cellular_component NA      SAM_2 
NM576455 ILMN_2195138 1  65.74 protein_coding protein binding   molecular_function  SAM_2 
NM576455 ILMN_1709067 1  65.74 protein_coding nucleus cellular_component NA      SAM_2 
NM576455 ILMN_1709067 1  65.74 protein_coding protein binding   molecular_function  SAM_2 
NM576455 ILMN_2195138 1  65.74 protein_coding nucleus cellular_component NA      SAM_type1

的解決方案可能是簡單的join（見info join）：

$ join <(sort mart.tsv) <(sort fcu.tsv) | column -t 
NM539570 ILMN_2199362 15 58.52 protein_coding NA  NA     NA     NA   0.204545455 0.607877092 2247 
NM576455 ILMN_1709067 1 65.74 protein_coding nucleus cellular_component NA     SAM_2  0.324009324 0.578896174 2577 
NM576455 ILMN_1709067 1 65.74 protein_coding protein binding    molecular_function SAM_2  0.324009324 0.578896174 2577 
NM576455 ILMN_2195138 1 65.74 protein_coding nucleus cellular_component NA     SAM_2  0.324009324 0.578896174 2577 
NM576455 ILMN_2195138 1 65.74 protein_coding nucleus cellular_component NA     SAM_type1 0.324009324 0.578896174 2577 
NM576455 ILMN_2195138 1 65.74 protein_coding protein binding    molecular_function SAM_2  0.324009324 0.578896174 2577

column來自bsdmainutils包。

來源

2012-07-15 10:12:08 Thor

一切都OK，但我有兩個問題： 1）你怎麼樣填充mart.tsv（在任何空字段中添加NA）？ 2）爲什麼你使用列？結果看起來非常有線與字段之間的許多空間，我做了沒有列和結果看起來更好 – 2012-07-15 10:42:57

1）我做了它手動，你怎麼去取決於結果來自哪裏。 2）只是爲了格式化，如果結果進入程序，這是多餘的。 – Thor 2012-07-15 11:34:31

單向使用awk。對於FNR == NR的情況，它讀取參數的第一個輸入文件（fcu.tsv）並保存爲散列值，第一個字段作爲關鍵字，其餘字段以\t作爲值加入。對於FNR < NR，請參閱mart.tsv，如果第一個字段與散列的鍵匹配，請在該行的末尾加入其值，否則打印原始行。

內容script.awk：

BEGIN { 
    OFS = "\t" 
} 

FNR == NR { 
    for (i = 2; i <= NF; i++) { 
     line = (line ? line OFS : "") $i 
    } 
    fcu[ $1 ] = line 
    line = "" 
    next 
} 

FNR < NR { 
    if ($1 in fcu) { 
     print $0 OFS fcu[ $1 ] 
    } 
    else { 
     print $0 
    } 
}

運行它想：

awk -f script.awk fcu.tsv mart.tsv

有了以下的輸出：

NM539570 ILMN_2199362 15  58.52 protein_coding 0.204545455  0.607877092  2247 
NM576455 ILMN_2195138 1  65.74 protein_coding nucleus cellular_component  SAM_2 0.324009324  0.578896174  2577 
NM576455 ILMN_2195138 1  65.74 protein_coding protein binding molecular_function  SAM_2 0.324009324  0.578896174  2577 
NM576455 ILMN_1709067 1  65.74 protein_coding nucleus cellular_component  SAM_2 0.324009324  0.578896174  2577 
NM576455 ILMN_1709067 1  65.74 protein_coding protein binding molecular_function  SAM_2 0.324009324  0.578896174  2577 
NM576455 ILMN_2195138 1  65.74 protein_coding nucleus cellular_component  SAM_type1  0.324009324  0.578896174  2577

來源

2012-07-15 10:33:22 Birei

將文件中的字段添加到另一個基於ID的字段

回答

相關問題