2014-10-06 43 views
1

我正在嘗試搜索某些數組以獲得最佳匹配。例如,給定的@source@search列表:在Perl中搜索rexp中匹配的最大數量

my @source = ("John Ronald Reuel Tolkien","John Ronald S Tolkien","Trent Reznor","Barack Hussein Obama II","Barack Hussein II"); #note that the second item is wrong and should be discarded! 
my @search = ("John Ronald Reuel T","Trent Reznor","Barack Hussein II","Barack Hussein Obama II","No match here"); 

我想給@search列表與最佳匹配的@source列表關聯。我計算這可以用幾個OR s的搜索模式來完成,但我被卡住了。請參閱我下面的例子:

#!/usr/local/bin/perl 
use strict; 
use warnings; 

my @source = ("John Ronald Reuel Tolkien","John Ronald S Tolkien","Trent Reznor","Barack Hussein Obama II","Barack Hussein II"); 
my @search = ("John Ronald Reuel T","Trent Reznor","Barack Hussein II","Barack Hussein Obama II","No match here"); 

print "twonames\t\talternativesearch\n"; 
foreach my $s (@search){ 
    #gets first two names 
    (my $twonames=$s)=~s/^(\w+ \w+).*$/$1/; 

    #gets all other names, if they exist 
    (my $others=$s)=~s/^(\w+ \w+)//; 
    if ($others){ 
     #deletes initial space 
     (my $alternativesearch=$others)=~s/^\s//; 
     $alternativesearch=~s/\s/\|/g; 
     print "$twonames\t\t$alternativesearch\n"; 
    } 
    else { 
     print "$twonames\t\tNO OTHER NAMES PRESENT\n"; 
    } 
} 
    #prints 
    # twonames    alternativesearch 
    # John Ronald    R|Tolkien 
    # Trent Reznor   NO OTHER NAMES PRESENT 
    # Barack Hussein   II 
    # Barack Hussein   Obama|II 

在這個搜索我想有@search項目和@source項目將產生最佳匹配之間的關聯。喜歡的東西:

# search     source 
# John Ronald Reuel T  John Ronald Reuel Tolkien 
# Trent Reznor    Trent Reznor 
# Barack Hussein Obama II Barack Hussein Obama II 
# No match here    

需要注意的是,在奧巴馬的情況下,它匹配整個陣列,它匹配的兩個第一句話加上別的東西的第一線,並在最後的情況下,它什麼也沒有發現。你將如何繼續尋找最佳匹配? 謝謝

編輯:這是交叉的on PerlMonks。編輯2:儘管我在我的例子中使用了人名,但我的真實案例在重要的情況下沒有人名。

+5

你看過cpan的單詞匹配/模糊匹配模塊嗎?這裏已經有一些數字,除非你有「最佳匹配」的具體定義(例如,整個單詞匹配很好,但沒有部分單詞匹配;字符串之間的最小Levenshtein距離),你可以節省很多時間使用現有的模塊。 – 2014-10-06 11:07:35

+0

太棒了!我會找的!謝謝! – Sosi 2014-10-06 11:36:38

+0

交叉點:http://www.perlmonks.org/?node_id=1102921 – choroba 2014-10-06 12:14:21

回答

1

如果你的'最佳匹配'的定義是簡單的長度,它看起來像你的例子,那麼你也可以按相反的長度對你的@source數組進行排序,然後如果你匹配其中的一個,你可以跳過休息。標籤是方便控制兩個循環:

#!/usr/local/bin/perl 
use strict; 
use warnings; 

my @source = ("John Ronald Reuel Tolkien","John Ronald S Tolkien","Trent Reznor","Barack Hussein Obama II","Barack Hussein II"); 
my @search = ("John Ronald Reuel T","Trent Reznor","Barack Hussein II","Barack Hussein Obama II","No match here"); 

# print header line 
printf "%-32s\t%-32s\n", 'search', 'source'; 

search: 
for my $se (@search) { 

    source: 
    for my $so (reverse sort {length($a) <=> length($b)} @source) { 

     if ($so =~ /^\Q$se\E/) { 
      printf "%-32s\t%-32s\n", $se, $so; 
      next search; 
     } 
    } 
    print "No match here\n"; 
} 

輸出:

$ perl myscript.pl 
search         source 
John Ronald Reuel T      John Ronald Reuel Tolkien 
Trent Reznor       Trent Reznor 
Barack Hussein II      Barack Hussein II 
Barack Hussein Obama II     Barack Hussein Obama II 
No match here 
$ 

你真的不需要正則表達式或模式匹配來做到這一點,那將是沒有更快。上面只有一個正則表達式,它是區分大小寫的,它被錨定到字符串的開頭。因此我們可以用substr代替。這可以說是更容易/更可維護,但絕對更有效率(它是我的基準測試的兩倍):

 if (substr($so, 0, length($se)) eq $se) {