0
我必須根據所有文件的第7列合併來自多個文件的第2列的值,所以根據Ed Morton在類似問題中的回答(Combining certain columns of several tab-delimited files based on first column),我寫這樣的代碼:根據特定列合併多個文件中的某些列,而不是刪除重複的名稱
awk 'FNR==1 { ++numFiles}
!seen[$7]++ { keys[++numKeys] = $7 }
{ a[$7,numFiles] = $2 }
END {
for (keyNr=1; keyNr<=numKeys; keyNr++) {
key = keys[keyNr]
printf "%s", key
for (fileNr=1;fileNr<=numFiles;fileNr++) {
printf "\t%s", ((key,fileNr) in a ? a[key,fileNr] : "NA")
}
print ""
} } ' file1.txt file2.txt file3.txt > combined.txt
輸入文件1:
+-------+-----------------+----------+-------------+----------+-------------+-------------+
| ID | adj.P.Val_file1 | P.Value | t | B | logFC | Gene.symbol |
+-------+-----------------+----------+-------------+----------+-------------+-------------+
| 36879 | 1.66E-09 | 7.02E-14 | -12.3836337 | 21.00111 | -2.60060826 | AA |
| 33623 | 1.66E-09 | 7.39E-14 | -12.3599517 | 20.95461 | -2.53106808 | AA |
| 23271 | 2.70E-09 | 2.30E-13 | -11.8478184 | 19.93024 | -2.15050984 | BB |
| 67 | 2.70E-09 | 2.40E-13 | -11.829044 | 19.892 | -3.06680932 | BB |
| 33207 | 1.21E-08 | 1.35E-12 | -11.0793461 | 18.32425 | -2.65246816 | CC |
| 24581 | 1.81E-08 | 2.41E-12 | -10.8325542 | 17.79052 | -1.87937753 | CC |
| 32009 | 3.25E-08 | 5.05E-12 | -10.5240537 | 17.11081 | -1.46505166 | CC |
+-------+-----------------+----------+-------------+----------+-------------+-------------+
輸入文件2:
+-------+-----------------+----------+------------+-----------+------------+--------------+
| ID | adj.P.Val_file2 | P.Value | t | B | logFC | Gene.symbol |
+-------+-----------------+----------+------------+-----------+------------+--------------+
| 40000 | 5.43E-13 | 1.21E-17 | 17.003819 | 29.155646 | 2.4805744 | FGH |
| 32388 | 1.15E-11 | 5.12E-16 | 14.920047 | 25.829874 | 2.2497567 | FGH |
| 33623 | 6.08E-11 | 4.43E-15 | -13.8115 | 23.870549 | -2.8161587 | ASD |
| 25002 | 6.08E-11 | 5.40E-15 | 13.713018 | 23.689571 | 2.2164681 | ASD |
| 33207 | 2.03E-10 | 2.29E-14 | -13.009752 | 22.36291 | -2.8787392 | ASD |
| 13018 | 2.03E-10 | 2.71E-14 | 12.929201 | 22.207038 | 3.0181585 | ASD |
| 5539 | 2.24E-10 | 3.48E-14 | 12.810902 | 21.976634 | 3.0849706 | ASD |
+-------+-----------------+----------+------------+-----------+------------+--------------+
所需的輸出:
+-------------+-----------------+-----------------+
| Gene.symbol | adj.P.Val_file1 | adj.P.Val_file2 |
+-------------+-----------------+-----------------+
| AA | 1.66E-09 | NA |
| AA | 1.66E-09 | NA |
| BB | 2.70E-09 | NA |
| BB | 2.70E-09 | NA |
| CC | 1.21E-08 | NA |
| CC | 1.81E-08 | NA |
| CC | 3.25E-08 | NA |
| FGH | NA | 5.43E-13 |
| FGH | NA | 1.15E-11 |
| ASD | NA | 6.08E-11 |
| ASD | NA | 6.08E-11 |
| ASD | NA | 2.03E-10 |
| ASD | NA | 2.03E-10 |
| ASD | NA | 2.24E-10 |
+-------------+-----------------+-----------------+
的問題是,第7列有重複的名稱,代碼需要一個特別的名字第一次出現,我想對所有的重複名稱的結果。我嘗試刪除代碼的每一行,並理解,但不能拿出解決方案
請張貼樣本輸入和期望的輸出,以便它對讀者有用。 –
我希望上面的例子會有所幫助,文件的列是分開的,但是當我想要通過按Tab鍵來分隔標籤時,它會打開標籤對話框,所以請考慮上面的例子。第一和第二文件具有相同coloumn頭,即:ID \t,adj.P.Val_file1 \t,P.Value \t,噸\t,B \t,logFC \t,Gene.symbol而所需的輸出文件應該只有:Gene.symbol \t,adj.P.Val_file1 \t,adj.P.Val_file2 –
不確定第8行和第9行來自您的預期輸出? –