2017-02-13 123 views
1

我試圖與它發生在File1thisthis問題,即在File2每串/線路匹配的主題結合起來(每串僅發生一次)匹配同時打印出現在File2上的整行,同時打印每個匹配之間的行(即序列號爲File2)。AWK/SED:文件和打印一切的匹配模式之間

文件1

>GAXI01000525.151.1950 Eukaryota;Opisthokonta;Holozoa;Metazoa (Animalia);Eumetazoa;Bilateria;Arthropoda;Hexapoda;Ellipura;Collembola;Tetrodontophora bielanensis (giant springtail) 
CCUGGUUGAUCCUGCCAGUAGUCAUAUGCUUGUCUCAAA 
GAUUAAGCCAUGCAUGUCUAAGUUCAAGCAAAAAUAAAG 
ACCGCGAAUGGCUCAUUAUAUCAGUUAUGGUUCCUUAGA 
ACUUACUACUUGGAUAACUGUGGUAAUUCUAGAGCUAAU 
>GAXI01000526.151.1950 Eukaryota;Opisthokonta;Holozoa;Metazoa (Animalia);Eumetazoa;Bilateria;Arthropoda;Hexapoda;Ellipura;Collembola;Tetrodontophora bielanensis (giant springtail) 
CCUGGUUGAUCCUGCCAGUAGUCAUAUGCUUGUCUCAAAGAU 
UAAGCCAUGCAUGUCUAAGUUCAAGCAAAAAUAAAGUGAAAC 
>GAXI01005455.1.1233 Bacteria;Bacteroidetes;Flavobacteriia;Flavobacteriales;Flavobacteriaceae;Chryseobacterium;Tetrodontophora bielanensis (giant springtail) 
CUUUCGAAAGGAAGAUUAAUACCCCAUAACAUA 
>GAXI01006199.29.1525 Bacteria;Chlamydiae;Chlamydiae;Chlamydiales;Simkaniaceae;Candidatus Rhabdochlamydia;Tetrodontophora bielanensis (giant springtail) 
AGAAUUUGAUCUUGGUUCAGAUUGAAUGCUGG 
UGCAAGUCGAACGAAGCUAGAGGGCAACCUCU 

文件2

>GAXI01000525.151.1950 
>GAXI01006199.29.1525 

我到目前爲止有:

awk 'FNR==NR{a[$0];next} $1 in a' file2 file1 > output 

這給:

>GAXI01000525.151.1950 Eukaryota;Opisthokonta;Holozoa;Metazoa (Animalia);Eumetazoa;Bilateria;Arthropoda;Hexapoda;Ellipura;Collembola;Tetrodontophora bielanensis (giant springtail) 
>GAXI01006199.29.1525 Bacteria;Chlamydiae;Chlamydiae;Chlamydiales;Simkaniaceae;Candidatus Rhabdochlamydia;Tetrodontophora bielanensis (giant springtail) 

我想這樣的:

>GAXI01000525.151.1950 Eukaryota;Opisthokonta;Holozoa;Metazoa (Animalia);Eumetazoa;Bilateria;Arthropoda;Hexapoda;Ellipura;Collembola;Tetrodontophora bielanensis (giant springtail) 
CCUGGUUGAUCCUGCCAGUAGUCAUAUGCUUGUCUCAAA 
GAUUAAGCCAUGCAUGUCUAAGUUCAAGCAAAAAUAAAG 
ACCGCGAAUGGCUCAUUAUAUCAGUUAUGGUUCCUUAGA 
ACUUACUACUUGGAUAACUGUGGUAAUUCUAGAGCUAAU 
>GAXI01006199.29.1525 Bacteria;Chlamydiae;Chlamydiae;Chlamydiales;Simkaniaceae;Candidatus Rhabdochlamydia;Tetrodontophora bielanensis (giant springtail) 
AGAAUUUGAUCUUGGUUCAGAUUGAAUGCUGG 
UGCAAGUCGAACGAAGCUAGAGGGCAACCUCU 

原始文件包含數千行,以便以最快的速度解決方案表示讚賞,無論是AWK,sed的或其他任何東西......

+1

恕我直言,科學家們可以使用專爲他們工作而設計的工具獲得更快更好的結果。使用像https://metacpan.org/release/BioPerl和/或https://metacpan.org/release/FAST這樣的工具肯定可以更有效地實現目標...... – jm666

+0

當然,雖然我沒有這樣做每天查詢。 –

回答

1

@jO:嘗試:

awk 'FNR==NR{A[$1];next} ($0 ~ /^>/){Q=""} ($1 in A){Q=1} Q{print}' file2 file1 

編輯:現在在這裏添加一個解釋解決方案。

awk 'FNR==NR  ##### This condition will be TRUE when only file2 is being read. where FNR and NR are the awk's in-built keywords FNR and NR both shows number of lines in a Input_file only difference between them FNR gets RESET when it reads next file and NR keep on increase it's values till all files get read successfully. 
{A[$1];    ##### creating an array named A whose index is $1 first field of file2. 
next}    ##### putting next will skip all the further statements. 
        ##### All further mentioned statements will be executed in file1 only. 
($0 ~ /^>/)   ##### checking if any line is starting with > in file1 
{Q=""}    ##### Making variable named Q as nullified. 
($1 in A)   ##### Checking if current line's $1 is coming into array A, if yes then do following. 
{Q=1}    ##### If current $1 is coming into array A then make variable Q's value to 1. 
Q     ##### Check if Q's value is NOT NULL then do following. 
{print}    ##### print the lines whenever above condition is TRUE which has Q's value is NOT NULL. 
' file2 file1  ##### Mentioning Input_files file2 and file1 here. 
1

你可以用awk

awk 'FNR==NR{d[$1]; next}/^>/{f=0}$1 in d{f=1}f' file2 file1 

嘗試你

 
>GAXI01000525.151.1950 Eukaryota;Opisthokonta;Holozoa;Metazoa (Animalia);Eumetazoa;Bilateria;Arthropoda;Hexapoda;Ellipura;Collembola;Tetrodontophora bielanensis (giant springtail) 
CCUGGUUGAUCCUGCCAGUAGUCAUAUGCUUGUCUCAAA 
GAUUAAGCCAUGCAUGUCUAAGUUCAAGCAAAAAUAAAG 
ACCGCGAAUGGCUCAUUAUAUCAGUUAUGGUUCCUUAGA 
ACUUACUACUUGGAUAACUGUGGUAAUUCUAGAGCUAAU 
>GAXI01006199.29.1525 Bacteria;Chlamydiae;Chlamydiae;Chlamydiales;Simkaniaceae;Candidatus Rhabdochlamydia;Tetrodontophora bielanensis (giant springtail) 
AGAAUUUGAUCUUGGUUCAGAUUGAAUGCUGG 
UGCAAGUCGAACGAAGCUAGAGGGCAACCUCU 
1

這可能爲你工作(GNU SED):

sed 's:.*:/^&/bb:' file2 | sed -e ':a' -f - -e 'd;:b;n;/^>/ba;bb' file1 

轉換文件2到比賽將被打印fr om file1,否則刪除非匹配項。

使用sed的兩個調用。第一個使用file2來創建匹配的正則表達式,第二個使用框架來打印匹配到下一個記錄開始或文件結尾的行。