2015-09-27 85 views
0

我有一個文件,如下所示:插入模式匹配後面的線

Scaffold2 GeneWise  mRNA 3038 6649 
Scaffold2 GeneWise  CDS  3038 3480 
Scaffold2 GeneWise  CDS  4175 4291 
Scaffold3 GeneWise  mRNA 2824 15173 
Scaffold3 GeneWise  CDS  2824 3302 
Scaffold3 GeneWise  CDS  4143 4344 

我想有這樣的輸出:

Scaffold2 GeneWise  mRNA 3038 6649 
Scaffold2 GeneWise  CDS  3038 **3480** 
Scaffold2 GeneWise  1st_intron  **3480 4175** 
Scaffold2 GeneWise  CDS  **4175** 4291 
Scaffold3 GeneWise  mRNA 2824 15173 
Scaffold3 GeneWise  CDS  2824 **3302** 
Scaffold3 GeneWise  1st_intron  **3302 4143** 
Scaffold3 GeneWise  CDS  **4143** 4344 

它應該如下: 如果列3是「表達',取下一行的第5列和該行的第4列,然後在包含第4列和第5列(如粗體數字表示)的兩行之間插入一條新行,並在第三列名爲'1st_intron'。

我從來沒有處理過這樣的問題,如果你能給我一些提示,那就太好了。

回答

2

你可以使用這個簡單的AWK:

awk '$3=="mRNA"{p=1; print; next} 
    p{s=$1 FS $2 FS "1st_intron" FS $5; print; p=0; next} 
    s{print s, $4; s=""} 1' file | column -t 

輸出:

Scaffold2 GeneWise mRNA  3038 6649 
Scaffold2 GeneWise CDS   3038 3480 
Scaffold2 GeneWise 1st_intron 3480 4175 
Scaffold2 GeneWise CDS   4175 4291 
Scaffold3 GeneWise mRNA  2824 15173 
Scaffold3 GeneWise CDS   2824 3302 
Scaffold3 GeneWise 1st_intron 3302 4143 
Scaffold3 GeneWise CDS   4143 4344 

column -t僅用於格式化輸出。

0

Perl解決方案。

$intron如果您不想做任何事情,則爲0。當您處理mRNA線時,它被設置爲1,因此$left可以記住下一行中的第一個數字,並將$intron設置爲2,從而打印內含子行並重置$intron

#!/usr/bin/perl 
use warnings; 
use strict; 

my $intron = 0; 
my ($left, $right); 
while (<>) { 
    my @items = split; 

    if (1 == $intron) { 
     $left = $items[4]; 
     $intron = 2; 

    } elsif (2 == $intron) { 
     print join "\t", @items[0, 1], '1st_intron', $left, $items[3]; 
     print "\n"; 
     $intron = 0; 
    } 

    $intron = 1 if 'mRNA' eq $items[2]; 
    print; 
} 
0

awk有一個很好的前瞻功能 「函數getline」:

awk '$3=="mRNA"{print;getline;c5=$5;print;getline;print $1," ",$2,"  1st_intron",c5,$4;print}' 

測試:

Scaffold2 GeneWise  mRNA 3038 6649 
Scaffold2 GeneWise  CDS  3038 3480 
Scaffold2 GeneWise  1st_intron 3480 4175 
Scaffold2 GeneWise  CDS  4175 4291 
Scaffold3 GeneWise  mRNA 2824 15173 
Scaffold3 GeneWise  CDS  2824 3302 
Scaffold3 GeneWise  1st_intron 3302 4143 
Scaffold3 GeneWise  CDS  4143 4344 
+0

這樣會在某些情況下重複行併產生各種其他負面影響。請參閱http://awk.info/?tip/getline –

1
$ cat tst.awk 
p1 == "mRNA" { x=$5 } 
p2 == "mRNA" { print $1, $2, "1st_intron", x, $4 } 
{ print; p2=p1; p1=$3 } 

$ awk -f tst.awk file | column -t 
Scaffold2 GeneWise mRNA  3038 6649 
Scaffold2 GeneWise CDS   3038 3480 
Scaffold2 GeneWise 1st_intron 3480 4175 
Scaffold2 GeneWise CDS   4175 4291 
Scaffold3 GeneWise mRNA  2824 15173 
Scaffold3 GeneWise CDS   2824 3302 
Scaffold3 GeneWise 1st_intron 3302 4143 
Scaffold3 GeneWise CDS   4143 4344