2017-07-26 57 views
-2

之間抽取的標籤在下面,其執行的是並導致電流輸出,我試圖以添加標記AF=FR=HRUN=後,將提取文本或 值的條件的awkLEN=TYPE=file1各行中的行數與file2相比較。如同 之間的行,這兩個文件是Match,Missing in file 1Missing in file2,但我無法添加條件以提取到;(分號)。 標籤後面可能並不總是有文字,但總是以;結尾。 $6中的小數點也是3個符號數字,以便於閱讀。它似乎接近 ,但有一些事情我不太確定該怎麼做。謝謝 :)。awk將兩個文件

file1的

chr1 43814978 COSM27286 G A 86.92679999999999 PASS  
AF=0;AO=1;DP=5535;FAO=0;FR=.,REALIGNEDx0.008;HRUN=1;LEN=1;TYPE=snp;VARB=0;HS; 
chr1 43814981 COSM27287 G A 86.83350000000002 PASS  
AF=0;AO=2;DP=5556;FAO=0;FR=.;HRUN=1;LEN=1;TYPE=snp;VARB=0;HS; 
chr1 43815008 COSM29008;COSM43212 TGG AAA,AAG 70.3099 PASS   
AF=0,0;AO=0,0;DP=5528;FAO=0,0;FR=.,.,;HRUN=1,1;LEN=3,2,;TYPE=mnp,mnp;VARB=0,0;HS; 

file2的

chr1 43814979 COSM27286 G A 86.92679999999999 PASS  
AF=0;AO=1;DP=5535;FAO=0;FR=.,REALIGNEDx0.008;HRUN=1;LEN=1;TYPE=snp;VARB=0;HS; 
chr1 43814981 COSM27287 G A 86.83350000000002 PASS  
AF=0;AO=2;DP=5556;FAO=0;FR=.;HRUN=1;LEN=1;TYPE=snp;VARB=0;HS; 
chr1 43815008 COSM29008;COSM43212 TGG AAA,AAG 70.3099 PASS   
AF=0,0;AO=0,0;DP=5528;FAO=0,0;FR=.,.,;HRUN=1,1;LEN=3,2,;TYPE=mnp,mnp;VARB=0,0;HS; 

期望的輸出

Match: 
chr1 43814981 COSM27287 G A 86.8 PASS  
AF=0;FR=.;HRUN=1;LEN=1;TYPE=snp 
chr1 43815008 COSM29008;COSM43212 TGG AAA,AAG 70.3099 PASS   
AF=0,0;FR=.,.,;HRUN=1,1;LEN=3,2,;TYPE=mnp,mnp 
Missing in file1: 
chr1 43814979 COSM27286 G A 86.9 PASS  
AF=0;FR=.,REALIGNEDx0.008;HRUN=1;LEN=1;TYPE=snp 
Missing in file2: 
chr1 43814978 COSM27286 G A 86.9 PASS  
AF=0;FR=.,REALIGNEDx0.008;HRUN=1;LEN=1;TYPE=snp 

AW ķ

awk 'FNR==1 { next } 
FNR == NR { file1[$1,$2,$3,$4,$5,$6,$7] = $1 " " $2 " " $3 " " $4 " " $5 " " $6 " "$7 } 
FNR != NR { file2[$1,$2,$3,$4,$5,$6,$7] = $1 " " $2 " " $3 " " $4 " " $5 " " $6 " "$7 } 
END { print "Match:"; for (k in file1) if (k in file2) print file1[k] # Or file2[k] 
     print "Missing in file1:"; for (k in file2) if (!(k in file1)) print file2[k] 
     print "Missing in file2:"; for (k in file1) if (!(k in file2)) print file1[k] 
}' file1 file2 > output 

電流輸出

Match: 
chr1 43814981 COSM27287 G A 86.83350000000002 PASS 
chr1 43815008 COSM29008;COSM43212 TGG AAA,AAG 70.3099 PASS 
Missing in File1: 
chr1 43814979 COSM27286 G A 86.92679999999999 PASS 
Missing in File2: 
chr1 43814978 COSM27286 G A 86.92679999999999 PASS 
+1

能否請您解釋一下在比賽第2線,應是邏輯(COSM29008; COSM43212; COSM19193; COSM27289; COSM28487)只來(COSM29008; COSM43212)? – RavinderSingh13

+2

您似乎在'awk'中提出了足夠的問題,至少爲解決問題做出了不錯的努力。但是你一直在問免費的代碼? – Inian

+0

對不起,這是我的一個錯字,你在代碼中是正確的。對不起,謝謝:)。 – Chris

回答

1

嘗試:

awk 'FNR==NR{ 
      a[$1,$2,$7]=$1 FS $2 FS $3 FS $4 FS $5 FS $6 FS $7; 
      next 
      } 
    (($1,$2,$7) in a){ 
      val_match=val_match?val_match ORS a[$1,$2,$7]:a[$1,$2,$7]; 
      delete a[$1,$2,$7]; 
      next 
         } 
{ 
    val_mismatch_in_file1=val_mismatch_in_file1?val_mismatch_in_file1 ORS $1 FS $2 FS $3 FS $4 FS $5 FS $6 FS $7:$1 FS $2 FS $3 FS $4 FS $5 FS $6 FS $7; 
} 
END{ 
    for(i in a){ 
     val_missing_in_file2=val_missing_in_file2?a[i]:a[i]}; 
     print "Match:" RS val_match RS "Missing in File1:" RS val_mismatch_in_file1 RS "Missing in File2:" RS val_missing_in_file2 
    } 
    ' Input_file1 Input_file2 

輸出將是如下。

Match: 
chr1 43814981 COSM27287 G A 86.83350000000002 PASS 
chr1 43815008 COSM29008;COSM43212;COSM19193;COSM27289;COSM28487 TGG AAA,AAG,AGG,CGG,GCG 70.3099 PASS 
Missing in File1: 
chr1 43814979 COSM27286 G A 86.92679999999999 PASS 
Missing in File2: 
chr1 43814978 COSM27286 G A 86.92679999999999 PASS