2014-09-19 54 views
-1

我寫了一個PERL程序,該程序需要一個Excel工作表(通過將.xls擴展名改爲.txt來轉換爲文本文件)以及一個用於輸入的序列文件。 Excel工作表包含序列文件中某個區域的起始點和結束點(以及匹配區域任一側的70個側翼值),這些文件需要剪切並提取到第三個輸出文件中。有300個值。程序讀入每次需要切割的序列的起始點和結束點,但它重複告訴我,如果顯然不是輸入文件的長度,則該值超出了該長度。我只是不能似乎得到這個固定Perl程序錯誤

這是程序

use strict; 
use warnings; 

my $blast; 
my $i; 
my $idline; 
my $sequence; 
print "Enter Your BLAST result file name:\t"; 
chomp($blast = <STDIN>); # BLAST result file name 
print "\n"; 

my $database; 
print "Enter Your Gene list file name:\t"; 
chomp($database = <STDIN>); # sequence file 
print "\n"; 

open IN, "$blast" or die "Can not open file $blast: $!"; 

my @ids  =(); 
my @seq_start =(); 
my @seq_end =(); 

while (<IN>) { 

    #spliting the result file based on each tab 
    my @feilds = split("\t", $_); 
    push(@ids, $feilds[0]); #copying the name of sequence 
     #coping the 6th tab value of the result which is the start point of from where a value should be cut. 
    push(@seq_start, $feilds[6]); 
    #coping the 7th tab value of the result file which is the end point of a value should be cut. 
    push(@seq_end, $feilds[7]); 
} 
close IN; 

open OUT, ">Result.fasta" or die "Can not open file $database: $!"; 

for ($i = 0; $i <= $#ids; $i++) { 

    ($sequence) = &block($ids[$i]); 

    ($idline, $sequence) = split("\n", $sequence); 

    #extracting the sequence from the start point to the end point 
    my $seqlen = $seq_end[$i] - $seq_start[$i] - 1; 

    my $Nucleotides = substr($sequence, $seq_start[$i], $seqlen); #storing the extracted substring into $sequence 

    $Nucleotides =~ s/(.{1,60})/$1\n/gs; 

    print OUT "$idline\n"; 
    print OUT "$Nucleotides\n"; 
} 
print "\nExtraction Completed..."; 

sub block { 
    #block for id storage which is the first tab in the Blast output file. 
    my $id1 = shift; 
    print "$id1\n"; 
    my $start =(); 

    open IN3, "$database" or die "Can not open file $database: $!"; 

    my $blockseq = ""; 
    while (<IN3>) { 

     if (($_ =~ /^>/) && ($start)) { 

      last; 
     } 

     if (($_ !~ /^>/) && ($start)) { 

      chomp; 
      $blockseq .= $_; 
     } 

     if (/^>$id1/) { 

      my $start = $. - 1; 
      my $blockseq .= $_; 
     } 
    } 
    close IN3; 

    return ($blockseq); 
} 

BLAST結果文件:http://www.fileswap.com/dl/Ws7ehftejp/

序列文件:http://www.fileswap.com/dl/lPwuGh2oKM/

錯誤

SUBSTR之外字符串在Nucleotide_Extractor.pl第39行。

0在 Nucleotide_Extractor.pl線在Nucleotide_Extractor.pl線44 41.

使用未初始化值$核苷酸的級聯(。)或串

使用未初始化值$核苷酸的置換(一個或多個///)。

任何幫助是非常讚賞和查詢總是被邀請

+0

什麼是phytophthora文件?沒有它,我無法處理塊功能。你看起來像substr(「Hello」,45,4)那樣帶有字符串長度以外的起始索引的子字符串。由於它不返回$核苷酸也未初始化。我建議你檢查substr的索引。 – xtreak 2014-09-19 05:42:51

+0

@Wordzilla這是我在問題中提供的鏈接所使用的序列文件名。我已經將兩個輸入文件上傳到fileswap並提供了鏈接。請下載這兩個文件並進行處理。該序列屬於名爲Phytophthora的生物體。我現在改了文件名。謝謝 – 2014-09-19 05:56:01

+0

您還應該在腳本中使用strict,並使用'my'聲明所有變量 - 即'my $ sequence = ...'。 – 2014-09-19 07:04:22

回答

2

有幾個問題與現有的代碼,我結束了重寫劇本,而修復這些錯誤。您的實施效率不高,因爲它會打開,讀取並關閉Excel工作表中每個ID的序列文件。更好的方法是從序列文件中讀取和存儲數據,或者如果內存有限,請遍歷序列文件中的每個條目並從Excel文件中選取相應的數據。你也可以更好地使用散列,而不是數組;哈希將數據存儲在鍵 - 值對中,因此查找您要查找的內容更加容易。我也一直使用引用,因爲它們可以很容易地將數據傳入和傳出子例程。

如果您不熟悉perl數據結構,請查看perlfaq4perldscperlreftut有關於使用引用的信息。

現有代碼的主要問題是從fasta文件獲取序列的子例程沒有返回任何內容。在你的代碼中放置大量的調試語句以確保它正在做你認爲正在做的事情是一個好主意。我留在我的調試聲明中,但將它們評論出來。我也大量地評論了我改變的代碼。

#!/usr/bin/perl 
use strict; 
use warnings; 
# enables 'say', which prints out your text and adds a carriage return 
use feature ':5.10'; 
# a very useful module for dumping out data structures 
use Data::Dumper; 

#my $blast = 'infesmall.txt'; 
print "Enter Your BLAST result file name:\t"; 
chomp($blast = <STDIN>);  # BLAST result file name 
print "\n"; 

#my $database = 'infe.fasta'; 
print "Enter Your Gene list file name:\t"; 
chomp($database = <STDIN>); # sequence file 
print "\n"; 

open IN,"$blast" or die "Can not open file $blast: $!"; 

# instead of using three arrays, let's use a hash reference! 
# for each ID, we want to store the start and the end point. To do that, 
# we'll use a hash of hashes. The start and end information will be in one 
# hash reference: 
# { start => $fields[6], end => $fields[7] } 
# and we will use that hashref as the value in another hash, where the key is 
# the ID, $fields[0]. This means we can access the start or end data using 
# code like this: 
# $info->{$id}{start} 
# $info->{$id}{end} 
my $info; 

while(<IN>){ 
    #splitting the result file based on each tab 
    my @fields = split("\t",$_); 
    # add the data to our $info hashref with the ID as the key: 
    $info->{ $fields[0] } = { start => $fields[6], end => $fields[7] }; 
} 
close IN; 

#say "info: " . Dumper($info); 

# now read the sequence info from the fasta file 
my $sequence = read_sequences($database); 
#say "data from read_sequences:\n" . Dumper($sequence); 

my $out = 'result.fasta'; 
open(OUT, ">" . $out) or die "Can not open file $out: $!"; 

foreach my $id (keys %$info) { 

    # check whether the sequence exists 
    if ($sequence->{$id}) { 
     #extracting the sequence from the start point to the end point 
     my $seqlen = $info->{$id}{end} - $info->{$id}{start} - 1; 

     #say "seqlen: $seqlen; stored seq length: " . length($sequence->{$id}{seq}) . "; start: " . $info->{$id}{start} . "; end: " . $info->{$id}{end}; 

     #storing the extracted substring into $sequence 
     my $nucleotides = substr($sequence->{$id}{seq}, $info->{$id}{start}, $seqlen); 
     $nucleotides =~ s/(.{1,60})/$1\n/gs; 
     #say "nucleotides: $nucleotides"; 
     print OUT $sequence->{$id}{header} . "\n"; 
     print OUT "$nucleotides\n"; 
    } 
} 
print "\nExtraction Completed..."; 

sub read_sequences { 
    # fasta file 
    my $fasta_file = shift; 

    open IN3, "$fasta_file" or die "Can not open file $fasta_file: $!"; 

    # initialise two variables. We will store our sequence data in $fasta 
    # and use $id to track the current sequence ID 
    # the $fasta hash will look like this: 
    # $fasta = { 
    # 'gi|7212472|ref|NC_002387.2' => { 
    #  header => '>gi|7212472|ref|NC_002387.2| Phytophthora...', 
    #  seq => 'ATAAAATAATATGAATAAATTAAAACCAAGAAATAAAATATGTT...', 
    # } 
    #} 

    my ($fasta, $id); 

    while(<IN3>){ 
     chomp; 
     if (/^>/) { 
      if (/^>(\S+) /){ 
       # the header line with the sequence info. 
       $id = $1; 
       # save the data to the $fasta hash, keyed by seq ID 
       # we're going to build up an entry as we go along 
       # set the header to the current line 
       $fasta->{ $id }{ header } = $_; 
      } 
      else { 
       # no ID found! Erk. Emit an error and undef $id. 
       warn "Formatting error: $_"; 
       undef $id; 
      } 
     } 
     ## ensure we're getting sequence lines... 
     elsif (/^[ATGC]/) { 
      # if $id is not defined, there's something weird going on, so 
      # don't save the sequence. In a correctly-formatted file, this 
      # should not be an issue. 
      if ($id) { 
       # if $id is set, add the line to the sequence. 
       $fasta->{ $id }{ seq } .= $_; 
      } 
     } 
    } 
    close IN3; 
    return $fasta; 
} 
+0

讚賞...謝謝你的時間和麻煩。 – 2014-09-20 04:12:43