什麼是比較perl中的字符串數組的最佳方法

我想比較多個包含目錄文件列表的字符串數組。目標是確定每個目錄中存在哪些文件以及哪些文件不存在。試想一下：什麼是比較perl中的字符串數組的最佳方法

List1 List2 List3 List4 
a  a  e  f 
b  b  d  g 
c  f  a  h

結果應該是：

列表1：

 List1 List2 List3 List4 
a  yes  yes  yes  no 
b  yes  yes  no  no 
c  yes  no  no  no

列表2：

 List1 List2 List3 List4 
a  yes  yes  yes  no 
b  yes  yes  no  no 
f  no  yes  no  yes

...

我可以去通過所有的數組並瀏覽每個條目，經過所有其他陣列和做的grep：

for my $curfile (@currentdirfiles) { 
    if(grep(/$curfile/, @otherarrsfiles)) { 
     // Set 'yes' 
    } else { 
     // set 'no' 
    } 
}

我唯一擔心的是，我與幅度的0^2N爲了結束了。我可能無法做任何事情，因爲我最終會循環遍歷所有的數組。 grep函數中可能有一個改進，但我不確定。

有什麼想法？

來源

2011-04-27 EDJ

現在問題已經修改，這產生了你想要的答案。它在O（n ）時間內工作，這對於問題是最佳的（有輸出）。

#!/usr/bin/env perl 

use strict; 
use warnings; 

#List1 List2 List3 List4 
#a  a  e  f 
#b  b  d  g 
#c  f  a  h 

my(@lists) = ({ a => 1, b => 1, c => 1 }, 
       { a => 1, b => 1, f => 1 }, 
       { e => 1, d => 1, a => 1 }, 
       { f => 1, g => 1, h => 1 }, 
      ); 

my $i = 0; 
foreach my $list (@lists) 
{ 
    analyze(++$i, $list, @lists); 
} 

sub analyze 
{ 
    my($num, $ref, @lists) = @_; 
    printf "List %d\n", $num; 

    my $pad = "  "; 
    foreach my $i (1..4) 
    { 
     print "$pad List$i"; 
     $pad = ""; 
    } 
    print "\n"; 

    foreach my $file (sort keys %{$ref}) 
    { 
     printf "%-8s", $file; 
     foreach my $list (@lists) 
     { 
      my %dir = %{$list}; 
      printf "%-8s", (defined $dir{$file}) ? "yes" : "no"; 
     } 
     print "\n"; 
    } 
    print "\n"; 
}

我得到的輸出是：

List 1 
     List1 List2 List3 List4 
a  yes  yes  yes  no  
b  yes  yes  no  no  
c  yes  no  no  no  

List 2 
     List1 List2 List3 List4 
a  yes  yes  yes  no  
b  yes  yes  no  no  
f  no  yes  no  yes  

List 3 
     List1 List2 List3 List4 
a  yes  yes  yes  no  
d  no  no  yes  no  
e  no  no  yes  no  

List 4 
     List1 List2 List3 List4 
f  no  yes  no  yes  
g  no  no  no  yes  
h  no  no  no  yes

來源

2011-04-27 05:09:11

'定義'是什麼讓我更容易忍受，而在哈希工作將是最有效地搜索數百和數千行（文件）的事實。謝謝。 – EDJ 2011-04-27 18:27:00

對於大量的字符串查找，您通常希望使用散列。下面是做這件事的一種方法：

use strict; 
use warnings; 

# Define the lists: 
my @lists = (
    [qw(a b c)], # List 1 
    [qw(a b f)], # List 2 
    [qw(e d a)], # List 3 
    [qw(f g h)], # List 4 
); 

# For each file, determine which lists it is in: 
my %included; 

for my $n (0 .. $#lists) { 
    for my $file (@{ $lists[$n] }) { 
    $included{$file}[$n] = 1; 
    } # end for each $file in this list 
} # end for each list number $n 

# Print out the results: 
my $fileWidth = 8; 

for my $n (0 .. $#lists) { 

    # Print the header rows: 
    printf "\nList %d:\n", $n+1; 

    print ' ' x $fileWidth; 
    printf "%-8s", "List $_" for 1 .. @lists; 
    print "\n"; 

    # Print a line for each file: 
    for my $file (@{ $lists[$n] }) { 
    printf "%-${fileWidth}s", $file; 

    printf "%-8s", ($_ ? 'yes' : 'no') for @{ $included{$file} }[0 .. $#lists]; 
    print "\n"; 
    } # end for each $file in this list 
} # end for each list number $n

來源

2011-04-27 05:05:26 cjm

你對哈希的評論幫了我大忙。謝謝。 – EDJ 2011-04-27 18:27:54

最明顯的方法是使用perl5i和自動裝箱：

use perl5i; 
my @list1 = qw(one two three); 
my @list2 = qw(one two four);  

my $missing = @list1 -> diff(\@list2); 
my $both = @list1 -> intersect(\@list2);

在更嚴格的設置，使用哈希此作爲文件名，將是獨一無二的：

sub in_list { 
    my ($one, $two) = @_; 
    my (@in, @out); 
    my %a = map {$_ => 1} @$one; 

    foreach my $f (@$two) { 
     if ($a{$f}) { 
      push @in, $f; 
     } 
     else { 
      push @out, $f; 
     } 
    } 
    return (\@in, \@out); 
} 

my @list1 = qw(one two three); 
my @list2 = qw(one two four);  
my ($in, $out) = in_list(\@list1, \@list2); 

print "In list 1 and 2:\n"; 
print " $_\n" foreach @$in; 

print "In list 2 and not in list 1\n"; 
print " $_\n" foreach @$out;

來源

2011-04-27 05:06:18 Alex

爲什麼不記住每個文件是當你在閱讀它們。

比方說，喲u有一個目錄列表從@dirlist閱讀：

use File::Slurp qw(read_dir); 
my %in_dir; 
my %dir_files; 

foreach my $dir (@dirlist) { 
    die "No such directory $dir" unless -d $dir; 
    foreach my $file (read_dir($dir)) { 
     $in_dir{$file}{$dir} = 1; 
     push @{ $dir_files{$dir} }, $file; 
    } 
}

現在$in_dir{filename}將每個感興趣的目錄中定義條目， $dir_files{directory}將爲每個目錄中的文件列表...

foreach my $dir (@dirlist) { 
    print "$dir\n"; 
    print join("\t", "", @dirlist); 
    foreach my $file (@{ $dir_files{$dir} }) { 
     my @info = ($file); 
     foreach my $dir_for_file (@dirlist) { 
      if (defined $in_dir{$file}{$dir_for_file}) { 
       push @info, "Yes"; 
      } else { 
       push @info, "No"; 
      } 
     } 
     print join("\t", @info), "\n"; 
    } 
}

來源

2011-04-27 05:07:29 unpythonic

謝謝，但列表以文件形式發送給我。我不是從目錄中讀取的。但是，好點雖然:) – EDJ 2011-04-27 18:25:08

我的代碼更簡單，但輸出是不太你想要什麼：

@lst1=('a', 'b', 'c'); 
@lst2=('a', 'b', 'f'); 
@lst3=('e', 'd', 'a'); 
@lst4=('f', 'g', 'h'); 

%hsh=(); 

foreach $item (@lst1) { 
    $hsh{$item}="list1"; 
} 

foreach $item (@lst2) { 
    if (defined($hsh{$item})) { 
     $hsh{$item}=$hsh{$item}." list2"; 
    } 
    else { 
     $hsh{$item}="list2"; 
    } 
} 

foreach $item (@lst3) { 
    if (defined($hsh{$item})) { 
     $hsh{$item}=$hsh{$item}." list3"; 
    } 
    else { 
     $hsh{$item}="list3"; 
    } 
} 

foreach $item (@lst4) { 
    if (defined($hsh{$item})) { 
     $hsh{$item}=$hsh{$item}." list4"; 
    } 
    else { 
     $hsh{$item}="list4"; 
    } 
} 

foreach $key (sort keys %hsh) { 
    printf("%s %s\n", $key, $hsh{$key}); 
}

給出：

a list1 list2 list3 
b list1 list2 
c list1 
d list3 
e list3 
f list2 list4 
g list4 
h list4

來源

2011-04-27 05:12:05

-2

我會建立一個散列使用目錄條目作爲包含哈希鍵（實際盟友設置）的每個列表中找到。迭代每個列表，每個新條目將其添加到外部散列，並使用包含首次遇到列表標識符的單個集合（或散列）。對於在散列中找到的任何條目，只需將當前列表標識符添加到值的集合/散列。

從那裏你可以簡單地發佈處理散列的排序鍵，並創建結果表的行。

我個人認爲Perl是醜陋的，但這裏是Python中的示例：

#!/usr/bin/env python 
import sys 
if len(sys.argv) < 2: 
    print >> sys.stderr, "Must supply arguments" 
    sys.exit(1) 
args = sys.argv[1:] 

# build hash entries by iterating over each listing 
d = dict() 
for each_file in args: 
    name = each_file 
    f = open(each_file, 'r') 
    for line in f: 
     line = line.strip() 
     if line not in d: 
      d[line] = set() 
     d[line].add(name) 
    f.close() 

# post process the hash 
report_template = "%-20s" + (" %-10s" * len(args)) 
print report_template % (("Dir Entries",) + tuple(args)) 
for k in sorted(d.keys()): 
    row = list() 
    for col in args: 
     row.append("yes") if col in d[k] else row.append("no") 
    print report_template % ((k,)+tuple(row))

這應該主要是清晰可辨，就像它是僞代碼。 (k,)和("Dir Entries",)表達式可能看起來有點奇怪;但這是爲了強制它們是使用%運算符將字符串解壓縮爲格式字符串所必需的元組。例如，這些也可以寫爲tuple([k]+row)（包裝[]中的第一項使其成爲可以添加到另一個列表並且全部轉換爲元組的列表）。

除此之外，Perl的翻譯應該非常簡單，只需使用散列而不是字典和集合。（順便說一下，這個例子可以處理任意數量的列表，作爲參數提供，並以列的形式輸出。很明顯，在十幾列之後，輸出會變得很麻煩，不便於打印或顯示;但這是一個簡單的概括做）。

來源

2011-04-27 05:30:28

請不要在Python中回答Perl問題。大多數時候，人們沒有選擇使用哪種語言。他們的老闆已經爲他們做出了決定。 – shawnhcorey 2011-04-27 13:23:30

我表示，其目的是爲了讀取僞代碼。它恰好也是Python的事實與它基本上是正交的（除了它允許我測試它的小事實之外）。 – 2011-04-27 19:44:24

對不起，遲到的回覆，我一直拋光這一段時間，因爲我不想再有一個負面評分（打消我）。

這是一個有趣的效率問題。我不知道我的解決方案是否適合你，但我想我會分享它。只有當你的數組沒有太頻繁地改變時，以及你的數組是否包含許多重複值，這可能是有效的。我沒有對它進行任何效率檢查。

基本上，解決方案是通過將數組值轉換爲位來移除交叉檢查的一個維度，並且一次對整個數組進行按位比較。數組值被刪除，排序並給出一個序列號。陣列總序列號然後通過按位或以單個值存儲。單個陣列可以由此被檢查一個序列號只有一個的操作，例如：

if (array & serialno)

這將需要一個運行準備數據，然後可將其保存在高速緩存或相似。這些數據可以在數據更改之前使用（例如文件/文件夾被刪除或添加）。我在未定義的值上添加了一個致命的退出，這意味着數據在發生時必須刷新。

祝你好運！

use strict; 
use warnings; 

my @list1=('a', 'b', 'c'); 
my @list2=('a', 'b', 'f'); 
my @list3=('e', 'd', 'a'); 
my @list4=('f', 'g', 'h'); 

# combine arrays 
my @total = (@list1, @list2, @list3, @list4); 

# dedupe (Thanks Xetius for this code snippet) 
my %unique =(); 
foreach my $item (@total) 
{ 
    $unique{$item} ++; 
} 
# Default sort(), don't think it matters 
@total = sort keys %unique; 

# translate to serial numbers 
my %serials =(); 
for (my $num = 0; $num <= $#total; $num++) 
{ 
    $serials{$total[$num]} = $num; 
} 

# convert array values to serial numbers, and combine them 
my @tx =(); 
for my $entry (@list1) { $tx[0] |= 2**$serials{$entry}; } 
for my $entry (@list2) { $tx[1] |= 2**$serials{$entry}; } 
for my $entry (@list3) { $tx[2] |= 2**$serials{$entry}; } 
for my $entry (@list4) { $tx[3] |= 2**$serials{$entry}; } 

&print_all; 

sub inList 
{ 
    my ($value, $list) = @_; 
    # Undefined serial numbers are not accepted 
    if (! defined ($serials{$value})) { 
      print "$value is not in the predefined list.\n"; 
      exit; 
    } 
    return (2**$serials{$value} & $tx[$list]); 
} 

sub yesno 
{ 
    my ($value, $list) = @_; 
    return (&inList($value, $list) ? "yes":"no"); 
} 
# 
# The following code is for printing purposes only 
# 
sub print_all 
{ 
    printf "%-6s %-6s %-6s %-6s %-6s\n", "", "List1", "List2", "List3", "List4"; 
    print "-" x 33, "\n"; 
    &table_print(@list1); 
    &table_print(@list2); 
    &table_print(@list3); 
    &table_print(@list4); 
} 

sub table_print 
{ 
    my @list = @_; 
    for my $entry (@list) { 
     printf "%-6s %-6s %-6s %-6s %-6s\n", $entry, 
      &yesno($entry, 0), 
      &yesno($entry, 1), 
      &yesno($entry, 2), 
      &yesno($entry, 3); 
    } 
    print "-" x 33, "\n"; 
}

來源

2011-04-27 21:34:41 TLP

什麼是比較perl中的字符串數組的最佳方法

回答

相關問題