2016-08-16 100 views
2

我有一個DNA序列,例如ATCGATCG。我也有格式化的DNA序列的數據庫,如下所示:Perl:返回字符串的最高百分比匹配

>Name of sequence1 
SEQUENCEONEEXAMPLEGATCGATC 
>Name of sequence2 
SEQUENCETWOEXAMPLEGATCGATC 

(所以奇數行包含名稱和偶數行包含一個序列) 目前,我尋找我的序列之間的完美匹配在如下數據庫序列(假設所有的變量聲明):

my $name; 
my $seq; 
my $returnval = "The sequence does not match any in database"; 
open (my $database, "<", $db1) or die "Can't find db1"; 
until (eof $database){ 
    chomp ($name = <$database>); 
    chomp ($seq = <$database>); 
    if (
     index($seq, $entry) != -1 
     || index($entry, $seq) != -1 
    ) { 
     $returnval = "The sequence matches: ". $name; 
     last; 
    } 
} 
close $database; 

有什麼辦法,我返回比例最高的匹配序列的名稱以及匹配百分比有入口和之間數據庫中的序列?

+1

數據庫有多大? – Zaid

+0

不確定['String :: Approx'](https://metacpan.org/pod/String::Approx)是否可以幫助你。 – Zaid

+1

你可以分解你的字符串並按char排列,儘管它很挑剔。例如,就像['這篇文章'](http://stackoverflow.com/questions/9106978/perl-partial-match)中所做的一樣。更好的是,找到一個模塊 - 例如['Text :: Fuzzy'](http://search.cpan.org/~bkb/Text-Fuzzy-0.24/lib/Text/Fuzzy.pod)應該這樣做。 – zdim

回答

3

String::Similarity將字符串之間的相似性返回爲0和1之間的值,0完全不相似,1完全相同。

my $entry = "AGGUUG" ; 
my $returnval; 
my $name; 
my $seq; 
my $currsim; 
my $highestsim = 0; 
my $highestname; 
open (my $database, "<", $db1) or die "Can't find db1"; 
until (eof $database){ 
    chomp ($name = <$database>); 
    chomp ($seq = <$database>); 
    $currsim = similarity $entry, $seq, $highestsim; 
    if ($currsim > $highestsim) { 
     $highestsim = $currsim; 
     $highestname = $name; 
    } 
} 
$highestsim = $highestsim * 100; 
my @names = split(/>/, $highestname); 
$returnval = "This sequence matches " . $names[1] . " the best with " . $highestsim . "% similarity"; 
close $database; 
+1

如果將'$ highestsim'作爲「相似性」的第三個參數傳遞,您應該看到性能提高 - 一旦相似性下降到給定限制以下,就會停止比較。 –

+0

有道理。我會添加它 –