2014-08-28 119 views
3

試圖弄清楚最快的方式串過濾的多頭排列,通過尋找成員的子串,下一個方法:查找在列表的成員用Perl

  • $str =~ /\.xml/ - 找到」。 XML 「的地方串
  • $str =~ /\.xml$/中 - 找到名爲」 .xml 「的字符串
  • substr($str,-4) eq ".xml"結束 - 是最後4個字符」 .XML「?
  • rindex($str, ".xml") - 發生任何「.xml」
  • (length($str) - rindex($str,".xml")) == 4 - 是最後4個字符「.xml」嗎?

我嘗試了上述所有與while/if/push和內grep的,下一個代碼(從註釋與想法修訂版)

use 5.016; 
use warnings; 
use Benchmark qw(:all); 

my $nmax = 5_000_000; 
my @list = map { sprintf "a%s.%s", int(rand(100000000)), (int(rand(2))%2?"txt":"xml") } 1..$nmax; 

cmpthese(10, { 
     'whl_match' => sub { my @xml; while(my ($i, $x) = each @list) { push(@xml, $x) if($x =~ /\.xml/ )}; }, 
     'whl_matchend' => sub { my @xml; while(my ($i, $x) = each @list) { push(@xml, $x) if($x =~ /\.xml$/)}; }, 
     'whl_matchendz'=> sub { my @xml; while(my ($i, $x) = each @list) { push(@xml, $x) if($x =~ /\.xml\z/)}; }, 
     'whl_substr' => sub { my @xml; while(my ($i, $x) = each @list) { push(@xml, $x) if(substr($x,-4) eq ".xml")}; }, 
     'whl_rindex' => sub { my @xml; while(my ($i, $x) = each @list) { push(@xml, $x) if(rindex($x,".xml") >= 0)}; }, 
     'whl_lenrindex'=> sub { my @xml; while(my ($i, $x) = each @list) { push(@xml, $x) if((length($x)-rindex($x,".xml"))==4)};}, 

     'for_match' => sub { my @xml; for my $x (@list) { push(@xml, $x) if($x =~ /\.xml/ )}; }, 
     'for_matchend' => sub { my @xml; for my $x (@list) { push(@xml, $x) if($x =~ /\.xml$/)}; }, 
     'for_matchendz'=> sub { my @xml; for my $x (@list) { push(@xml, $x) if($x =~ /\.xml\z/)}; }, 
     'for_substr' => sub { my @xml; for my $x (@list) { push(@xml, $x) if(substr($x,-4) eq ".xml")}; }, 
     'for_rindex' => sub { my @xml; for my $x (@list) { push(@xml, $x) if(rindex($x,".xml") >= 0)}; }, 
     'for_lenrindex'=> sub { my @xml; for my $x (@list) { push(@xml, $x) if((length($x)-rindex($x,".xml"))==4)};}, 

     'grp_match' => sub { my @xml = grep { /\.xml/ } @list; }, 
     'grp_matchend' => sub { my @xml = grep { /\.xml$/ } @list; }, 
     'grp_matchendz'=> sub { my @xml = grep { /\.xml\z/ } @list; }, 
     'grp_substr' => sub { my @xml = grep { substr($_,-4) eq ".xml" } @list; }, 
     'grp_rindex' => sub { my @xml = grep { rindex($_,".xml") >= 0 } @list; }, 
     'grp_lenrindex'=> sub { my @xml = grep { (length($_) - rindex($_,".xml")) == 4 } @list; }, 
}); 

我的鯉魚筆記本電腦上的結果。

   s/iter whl_matchend whl_matchendz grp_matchendz grp_matchend whl_lenrindex whl_match whl_substr grp_match whl_rindex for_matchend for_matchendz for_lenrindex for_match grp_lenrindex for_substr for_rindex grp_substr grp_rindex 
whl_matchend 4.48   --   -0%   -10%   -12%   -17%  -21%  -24%  -25%  -32%   -47%   -47%   -67%  -70%   -70%  -73%  -73%  -77%  -78% 
whl_matchendz 4.46   0%   --   -9%   -11%   -17%  -21%  -23%  -25%  -32%   -46%   -46%   -67%  -69%   -70%  -73%  -73%  -76%  -78% 
grp_matchendz 4.05   11%   10%   --   -2%   -9%  -13%  -15%  -17%  -25%   -41%   -41%   -63%  -66%   -67%  -70%  -70%  -74%  -76% 
grp_matchend 3.96   13%   13%   2%   --   -6%  -11%  -14%  -15%  -24%   -40%   -40%   -62%  -66%   -66%  -70%  -70%  -73%  -75% 
whl_lenrindex 3.70   21%   21%   9%   7%   --  -5%  -8%  -9%  -18%   -35%   -35%   -60%  -63%   -64%  -67%  -67%  -72%  -73% 
whl_match  3.53   27%   27%   15%   12%   5%  --  -3%  -5%  -14%   -32%   -32%   -58%  -61%   -62%  -66%  -66%  -70%  -72% 
whl_substr  3.42   31%   30%   18%   16%   8%  3%   --  -2%  -12%   -30%   -30%   -57%  -60%   -61%  -65%  -65%  -69%  -71% 
grp_match  3.36   33%   33%   20%   18%   10%  5%   2%  --  -10%   -29%   -29%   -56%  -59%   -60%  -64%  -64%  -69%  -71% 
whl_rindex  3.02   48%   48%   34%   31%   22%  17%  13%  11%   --   -21%   -21%   -51%  -55%   -56%  -60%  -60%  -65%  -67% 
for_matchend 2.40   87%   86%   69%   65%   55%  47%  43%  40%  26%   --   -0%   -38%  -43%   -44%  -50%  -50%  -56%  -59% 
for_matchendz 2.39   87%   87%   69%   65%   55%  47%  43%  40%  26%   0%   --   -38%  -43%   -44%  -50%  -50%  -56%  -59% 
for_lenrindex 1.49   201%   200%   172%   166%   149%  137%  130%  126%  103%   61%   61%   --  -8%   -10%  -19%  -19%  -29%  -33% 
for_match  1.36   229%   227%   197%   191%   172%  159%  151%  146%  122%   76%   76%   9%  --   -2%  -11%  -12%  -23%  -27% 
grp_lenrindex 1.33   237%   236%   204%   198%   178%  165%  157%  153%  127%   80%   80%   12%  2%   --  -9%  -9%  -21%  -26% 
for_substr  1.21   271%   270%   235%   228%   207%  192%  184%  178%  150%   98%   98%   23%  13%   10%   --  -0%  -13%  -18% 
for_rindex  1.20   272%   271%   236%   229%   208%  193%  184%  179%  151%   99%   99%   23%  13%   10%   0%   --  -13%  -18% 
grp_substr  1.05   326%   325%   285%   277%   252%  235%  226%  220%  188%   128%   128%   41%  30%   27%  15%  15%   --  -6% 
grp_rindex  0.990   352%   351%   309%   300%   274%  256%  246%  239%  205%   142%   142%   50%  38%   34%  22%  22%   6%   -- 

我重複了很多次測試,總是得到了上面的順序。


問題1:

正如我所料,grep更快的相似while/if/push,但讓我吃驚的未來:

比較:

   s/iter 
whl_matchend 4.54 
grp_matchend 3.98 

grep稍快while/if/push相似。

爲什麼例如在下一:

whl_substr  3.23 
grp_substr  1.05 

grep3倍的速度作爲while/if/push。那麼,爲什麼grep在執行substr時比while/if/push快近3倍,如果使用/regex-match/時爲何不快(僅爲14%)?同樣,這可以看作是任何「字符串操作」。

隨着換句話說,grep {/regex/}只有輕微的速度遞增while/if/push以上但grep {substr}巨大的速度增加。 爲什麼


問題2

另一個(至少對我來說)驚喜的是未來:爲什麼$str =~ /\.xml/是更快$str =~ /\.xml$/?我期待,不是指定的$將加快rexex,因爲不需要整個字符串搜索 - 但這是一個錯誤的假設,與下一太測試:

use 5.016; 
use warnings; 
use Benchmark qw(:all); 

my $str = "a38877283.xml"; 
cmpthese(10, { 
    'match'  => sub { $str =~ /\.xml/ for (1..5_000_000) }, 
    'matchend' => sub { $str =~ /\.xml$/ for (1..5_000_000) }, 
    'matchendz' => sub { $str =~ /\.xml\z/ for (1..5_000_000) }, #updated the \z 
}); 

perl 5, version 20, subversion 0 (v5.20.0) built for darwin-2level(perlbrew )

  s/iter matchend matchendz  match 
matchend 2.32  --  -1%  -64% 
matchendz 2.30  1%  --  -63% 
match  0.844  175%  173%  -- 

隨着:perl 5, version 16, subversion 2 (v5.16.2) built for darwin-thread-multi-2level(默認OS X)

   Rate matchendz  match matchend 
matchendz 0.405/s  --  -69%  -70% 
match  1.29/s  218%  --  -5% 
matchend 1.36/s  235%  5%  -- 

perl的舊的更快。 ;)

操作系統:

Darwin jabko.local 13.3.0 Darwin Kernel Version 13.3.0: Tue Jun 3 21:27:35 PDT 2014; root:xnu-2422.110.17~1/RELEASE_X86_64 x86_64 

最後一個問題

  • 仍然沒有測試預編譯的正則表達式qr的效果 - 任何其他的想法有什麼可以最快的過濾器?
+0

你上'rindex'測試應該是'> = 0'。值得一試'sub {my @ xml;對於我的$ x(@list){push @xml,$ x if $ x =〜/\.xml/}}等等。 ' – Borodin 2014-08-28 10:52:29

+0

thanx,要添加並將更新與結果的問題。 – jm666 2014-08-28 10:54:33

+0

@Borodin是的,'for'肯定比'while'快得多。更新了結果。 – jm666 2014-08-28 11:26:46

回答

2

關於最後一個問題,嘗試\z代替$\z匹配字符串的結束,同時也$尋找可選尾隨換行符(perldoc perlre)。

use Benchmark qw(:all); 
my $str = "a38877283.xml"; 
cmpthese(10, { 
    'match' => sub { $str =~ /\.xml/ for (1..5_000_000) }, 
    'matchend' => sub { $str =~ /\.xml$/ for (1..5_000_000) }, 
    'matchend2' => sub { $str =~ /\.xml\z/ for (1..5_000_000) }, 
}); 

輸出

   Rate matchend2 matchend  match 
matchend2 0.473/s  --  -58%  -59% 
matchend 1.14/s  140%  --  -1% 
match  1.15/s  143%  1%  -- 
+1

@ jm666順便說一句,我想你的'while'會比較慢,因爲它會調用'each'。 – 2014-08-28 10:42:36

+0

好的,可能是的,但是爲什麼'while'與'grep'在做'regex'和'substr'時有區別呢?所有'while'使用相同(無效)的'each'。 – jm666 2014-08-28 10:45:48

+0

@ jm666它使用的是perl 5.10.1 Hm,在同一臺機器上使用5.20給了'match'很大的優勢。 :) – 2014-08-28 11:43:34