2017-08-30 33 views
0

我必須根據所有文件的第7列合併來自多個文件的第2列的值,所以根據Ed Morton在類似問題中的回答(Combining certain columns of several tab-delimited files based on first column),我寫這樣的代碼:根據特定列合併多個文件中的某些列,而不是刪除重複的名稱

awk 'FNR==1 { ++numFiles} 
!seen[$7]++ { keys[++numKeys] = $7 } 
{ a[$7,numFiles] = $2 } 
END { 
for (keyNr=1; keyNr<=numKeys; keyNr++) { 
    key = keys[keyNr] 
    printf "%s", key 
    for (fileNr=1;fileNr<=numFiles;fileNr++) { 
     printf "\t%s", ((key,fileNr) in a ? a[key,fileNr] : "NA") 
    } 
    print "" 
} } ' file1.txt file2.txt file3.txt > combined.txt 

輸入文件1:

+-------+-----------------+----------+-------------+----------+-------------+-------------+ 
| ID | adj.P.Val_file1 | P.Value |  t  | B  | logFC | Gene.symbol | 
+-------+-----------------+----------+-------------+----------+-------------+-------------+ 
| 36879 | 1.66E-09  | 7.02E-14 | -12.3836337 | 21.00111 | -2.60060826 | AA   | 
| 33623 | 1.66E-09  | 7.39E-14 | -12.3599517 | 20.95461 | -2.53106808 | AA   | 
| 23271 | 2.70E-09  | 2.30E-13 | -11.8478184 | 19.93024 | -2.15050984 | BB   | 
| 67 | 2.70E-09  | 2.40E-13 | -11.829044 | 19.892 | -3.06680932 | BB   | 
| 33207 | 1.21E-08  | 1.35E-12 | -11.0793461 | 18.32425 | -2.65246816 | CC   | 
| 24581 | 1.81E-08  | 2.41E-12 | -10.8325542 | 17.79052 | -1.87937753 | CC   | 
| 32009 | 3.25E-08  | 5.05E-12 | -10.5240537 | 17.11081 | -1.46505166 | CC   | 
+-------+-----------------+----------+-------------+----------+-------------+-------------+      

輸入文件2:

+-------+-----------------+----------+------------+-----------+------------+--------------+ 
| ID | adj.P.Val_file2 | P.Value |  t  |  B  | logFC | Gene.symbol | 
+-------+-----------------+----------+------------+-----------+------------+--------------+ 
| 40000 | 5.43E-13  | 1.21E-17 | 17.003819 | 29.155646 | 2.4805744 | FGH   | 
| 32388 | 1.15E-11  | 5.12E-16 | 14.920047 | 25.829874 | 2.2497567 | FGH   | 
| 33623 | 6.08E-11  | 4.43E-15 | -13.8115 | 23.870549 | -2.8161587 | ASD   | 
| 25002 | 6.08E-11  | 5.40E-15 | 13.713018 | 23.689571 | 2.2164681 | ASD   | 
| 33207 | 2.03E-10  | 2.29E-14 | -13.009752 | 22.36291 | -2.8787392 | ASD   | 
| 13018 | 2.03E-10  | 2.71E-14 | 12.929201 | 22.207038 | 3.0181585 | ASD   | 
| 5539 | 2.24E-10  | 3.48E-14 | 12.810902 | 21.976634 | 3.0849706 | ASD   | 
+-------+-----------------+----------+------------+-----------+------------+--------------+ 

所需的輸出:

+-------------+-----------------+-----------------+ 
| Gene.symbol | adj.P.Val_file1 | adj.P.Val_file2 | 
+-------------+-----------------+-----------------+ 
| AA   | 1.66E-09  | NA    | 
| AA   | 1.66E-09  | NA    | 
| BB   | 2.70E-09  | NA    | 
| BB   | 2.70E-09  | NA    | 
| CC   | 1.21E-08  | NA    | 
| CC   | 1.81E-08  | NA    | 
| CC   | 3.25E-08  | NA    | 
| FGH   | NA    | 5.43E-13  | 
| FGH   | NA    | 1.15E-11  | 
| ASD   | NA    | 6.08E-11  | 
| ASD   | NA    | 6.08E-11  | 
| ASD   | NA    | 2.03E-10  | 
| ASD   | NA    | 2.03E-10  | 
| ASD   | NA    | 2.24E-10  | 
+-------------+-----------------+-----------------+ 

的問題是,第7列有重複的名稱,代碼需要一個特別的名字第一次出現,我想對所有的重複名稱的結果。我嘗試刪除代碼的每一行,並理解,但不能拿出解決方案

+0

請張貼樣本輸入和期望的輸出,以便它對讀者有用。 –

+0

我希望上面的例子會有所幫助,文件的列是分開的,但是當我想要通過按Tab鍵來分隔標籤時,它會打開標籤對話框,所以請考慮上面的例子。第一和第二文件具有相同coloumn頭,即:ID \t,adj.P.Val_file1 \t,P.Value \t,噸\t,B \t,logFC \t,Gene.symbol而所需的輸出文件應該只有:Gene.symbol \t,adj.P.Val_file1 \t,adj.P.Val_file2 –

+0

不確定第8行和第9行來自您的預期輸出? –

回答

0

終於搞清楚了自己的答案!

我只是要消除線路:從我的代碼看到[$ 7] ++,如包括它只會考慮第七列任何複製的名稱,一般第一次出現(第n列! )

相關問題