計算字符串匹配以及確定哪些語句匹配可以在

我在Perl中這樣做。我有一個文本文件，其中包含幾個段落和61個句子。首先，我需要匹配一系列的命令行輸入的話，我有過一次這樣做沒有問題：計算字符串匹配以及確定哪些語句匹配可以在

my $input = $ARGV[0]; 
$file =~ m/$input/gi;

不幸的是，有一些wrinkles- 1.輸入可爲多個項目和 2.多個項目可以在不同的行。

我會告訴你一個例子： 3句匹配模式「秋|選| 2009」。這些句子是：

4：「我們討厭選舉。」 16：「狗從陽臺墜落時受傷。」 24：「2009年秋季沒有選舉。」選| | 2009年是輸入

在這種情況下，程序發現，無論是包含秋天，選舉或2009年，在秋天的文檔中數三句話。

我的問題是雙重的：如何計算句子的輸入出現在多少？我對regex很沒有經驗，但我會認爲默認匹配會嘗試匹配文件中發生的第一次出現的秋天，選舉或2009年，也不會計算每個單詞有多少個實例，以及然後添加它們。我有點擔心，因爲我不明白正則表達式。

我的問題的第二部分涉及如何首先找到哪個句子輸入被找到（即出現在第4行的選舉）以及如何提取輸入位於的整個句子。我認爲這將是使用第一個if完成：如果字符串中有匹配的輸入，則新的標量等於文本文件=〜替換？這句話......我完全不確定。

編輯：我實際上有一個完全解析的HTML文檔，我正在執行此操作。如果打印出來，一個例子的輸出是：「The Journal is now on Facebook！The view is this in progress，we're hungry for your feedback。因此，讓我們知道您對我們的討論的看法董事會，下面的評論或給我們發電子郵件。通過關注Twitter上的雜誌獲取重大新聞，內部信息和好奇心以下是您可能需要遵循的一些Feed和作者：「

我的命令行如下所示：perl WebScan.pl信息|作家WebPage000.htm

我有，如前面提到的通過網頁解析並去除所有標籤，只留下文字。現在，我必須找到輸入，在這種情況下是「信息」或「作家」。我必須找出這些文件中發生了多少次（所以2），以及它們出現在哪個句子中（分別是5和6）。我會到目前爲止告訴你我的代碼：

use strict; 
use warnings; 
my $file; 
open (FILENAME, $ARGV[1]); 
$file = do { local $/; <FILENAME> }; 

$file =~ s{ 
    <    # open tag 
    (?:    # open group (A) 
    (!--) |  # comment (1) or 
    (\?) |  # another comment (2) or 
    (?i:   # open group (B) for /i 
     (   #  one of start tags 
     SCRIPT | #  for which 
     APPLET | #  must be skipped 
     OBJECT | #  all content 
     STYLE  #  to correspond 
    )   #  end tag (3) 
    ) |   # close group (B), or 
    ([!/A-Za-z]) # one of these chars, remember in (4) 
)    # close group (A) 
    (?(4)   # if previous case is (4) 
    (?:   # open group (C) 
     (?!   #  and next is not : (D) 
     [\s=]  #  \s or "=" 
     ["`']  #  with open quotes 
    )   #  close (D) 
     [^>] |  #  and not close tag or 
     [\s=]  #  \s or "=" with 
     `[^`]*` | #  something in quotes ` or 
     [\s=]  #  \s or "=" with 
     '[^']*' | #  something in quotes ' or 
     [\s=]  #  \s or "=" with 
     "[^"]*"  #  something in quotes " 
    )*   # repeat (C) 0 or more times 
    |    # else (if previous case is not (4)) 
    .*?   # minimum of any chars 
)    # end if previous char is (4) 
    (?(1)   # if comment (1) 
    (?<=--)  # wait for "--" 
)    # end if comment (1) 
    (?(2)   # if another comment (2) 
    (?<=\?)  # wait for "?" 
)    # end if another comment (2) 
    (?(3)   # if one of tags-containers (3) 
    </   # wait for end 
    (?i:\3)  # of this tag 
    (?:\s[^>]*)? # skip junk to ">" 
)    # end if (3) 
    >    # tag closed 
}{}gsx;   # STRIP THIS TAG 
$file =~ s/&nbsp//gi; 
$file =~ s/&#160//gi; 
$file =~ s/;//gi; 

$file =~ s/[\h\v]+/ /g; 

my $count = $file =~ s/((^|\s)\S)/$2/g; 
my $sentencecount = $file =~ s/((^|\s)\S).*?(\.|\?|\!)/$1/g; 

print "Input file $ARGV[1] contains $sentencecount sentences and $count words.";

所以，我需要Perl來，使用$ ARGV [0]爲關鍵詞，通過文本文件進行搜索，計算的時候出現的關鍵字數量。然後，我需要說明關鍵字出現在哪個句子（即全部打印整個句子）以及句子所在的編號。

來源

2011-01-31 Sheldon

-1

編輯以匹配更新的問題

好吧，讓我開始一個真理：不要試圖自己來解析HTML。 HTML::TreeBuilder是你的朋友。

對於正則表達式，perlfaq6是一個很好的知識來源。

以下示例使用以下語法：perl WebScan.pl --regex="information|writers" --filename=WebPage000.htm。

它將打印段落及其匹配的列表。

#!/usr/bin/perl 
use warnings; 
use strict; 

use HTML::TreeBuilder; 
use Data::Dumper; 
use Getopt::Long; 

my @regexes; 
my $filename; 
GetOptions('regex=s' => \@regexes, 'filename=s' => \$filename); 

my $tb = HTML::TreeBuilder->new_from_file($filename); 
$tb->normalize_content; 

my @patterns = map { qr/$_/ } @regexes; 

my @all; 
foreach my $node ($tb->find_by_tag_name('p', 'pre', 'blockquote')) { 
    my $text = $node->as_text; 
    my @matches; 
    foreach my $r (@patterns) { 
     while ($text =~ /$r/gi) { 
      push @matches, $&; 
     } 
    } 
    push @all, { paragraph => $text, matches => \@matches } if @matches; 
} 

foreach (@all) { 
    print "Paragraph:\n\t$_->{paragraph}\nMatches:\n\t", join(', ', @{$_->{matches}}), "\n"; 
}

希望這可以指出你在正確的方向。

來源

2011-01-31 04:03:54

我編輯了我的問題，希望能讓事情更清楚。你的代碼做的事情與我想要的非常相似，我只是不知道如何實現它。 – Sheldon 2011-01-31 04:39:24

您不清楚您的句子是否有分隔符（或者您是否有一些分隔條件）。如果是這樣，如果明白你的問題所在，你可以做一些事情是這樣的：

@words = qw/hi bye 2009 a*d/; 
@lines = ('Lets see , hi ', 
' hi hi hi ', 
' asdadasdas ', 
'a2009a', 
'hi bye'); 

$pattern=""; 
foreach $word (@words) { 
    $pattern .= quotemeta($word) . '|'; 
} 
chop $pattern; # chop last | 
print "pattern='$pattern'\n"; 

$cont = 0; 
foreach $line (@lines) { 
    $cont++ if $line =~ /$pattern/o; 
} 

printf "$cont/%d lines matched\n",scalar(@lines);

我建立與quotemeta模式逃逸，以防萬一有在的話一些特殊字符（如在我的例子中，我們不希望它匹配）。

來源

2011-01-31 04:12:16 leonbloy

計算字符串匹配以及確定哪些語句匹配可以在

回答

相關問題