perl HTML解析的一些幫助

我正在開發一個小的perl程序，它將打開一個網站並搜索Hail Reports這個詞並將信息還給我。我對perl非常陌生，所以這可能很容易解決。首先我的代碼說我正在使用一個單位化的值。以下是我有perl HTML解析的一些幫助

#!/usr/bin/perl -w 
use LWP::Simple; 

my $html = get("http://www.spc.noaa.gov/climo/reports/last3hours.html") 
    or die "Could not fetch NWS page."; 
$html =~ m{Hail Reports} || die; 
my $hail = $1; 
print "$hail\n";

其次，我想正則表達式將做我想做的最簡單的方法，但我不知道如果我可以與他們無關了。我希望我的程序搜索Hail Reports，並將Hails Reports和Wind Reports這兩個詞之間的信息發回給我。這是可能的正則表達式或我應該使用不同的方法？這裏是我希望它在$ 1中發回

 <tr><th colspan="8">Hail Reports (<a href="last3hours_hail.csv">CSV</a>)&nbsp;(<a href="last3hours_raw_hail.csv">Raw Hail CSV</a>)(<a href="/faq/#6.10">?</a>)</th></tr> 

#The Data here will change throughout the day so normally there will be more info. 
     <tr><td colspan="8" class="highlight" align="center">No reports received</td></tr> 
     <tr><th colspan="8">Wind Reports (<a href="last3hours_wind.csv">CSV</a>)&nbsp;(<a href="last3hours_raw_wind.csv">Raw Wind CSV</a>)(<a href="/faq/#6.10">?</a>)</th></tr>

來源

2010-07-02 shinjuo

您可以使用XPath來試用嗎？ – 2010-07-02 19:46:31

你被捕捉什麼，因爲沒有你的正則表達式是用括號括起來的網頁源代碼中的一個片段。以下對我有用。

#!/usr/bin/perl 
use strict; 
use warnings; 

use LWP::Simple; 

my $html = get("http://www.spc.noaa.gov/climo/reports/last3hours.html") 
    or die "Could not fetch NWS page."; 

$html =~ m{Hail Reports(.*)Wind Reports}s || die; #Parentheses indicate capture group 
my $hail = $1; # $1 contains whatever matched in the (.*) part of above regex 
print "$hail\n";

來源

2010-07-02 19:56:44 d5e5

謝謝，很好地涵蓋了這兩個問題。 – shinjuo 2010-07-02 20:02:07

括號在正則表達式中捕獲字符串。你的正則表達式中沒有括號，所以$ 1沒有設置任何值。如果您有：

$html =~ m{(Hail Reports)} || die;

然後$ 1.將被設置爲「冰雹報告」，如果它在$ HTML變量存在。既然你只是想知道它是否匹配，那麼你真的不需要在這一點上捕獲任何你可以寫這樣的：

unless ($html =~ /Hail Reports/) { 
    die "No Hail Reports in HTML"; 
}

要捕獲你可以做一些像琴絃之間的事情：

if ($html =~ /(?<=Hail Reports)(.*?)(?=Wind Reports)/s) { 
    print "Got $1\n"; 
}

來源

2010-07-02 19:57:06 runrig

你需要正則表達式的's'修飾符來匹配換行符，即=〜/.../s – 2010-07-02 20:02:50

謝謝。更新。 – runrig 2010-07-02 20:05:25

未初始化值警告來自$ 1 - 它沒有被定義或設置在任何地方。

對於線路電平，而不是「之間的」字節級的，你可以使用：

for (split(/\n/, $html)) { 
    print if (/Hail Reports/ .. /Wind Reports/ and !/(?:Hail|Wind) Reports/); 
}

來源

2010-07-02 20:03:32

利用的單和多線相匹配。另外，它只會爲文本之間的第一個匹配，這會比貪婪更快一些。

#!/usr/bin/perl -w 

use strict; 
use LWP::Simple; 

    sub main{ 
     my $html = get("http://www.spc.noaa.gov/climo/reports/last3hours.html") 
       or die "Could not fetch NWS page."; 

     # match single and multiple lines + not greedy 
     my ($hail, $between, $wind) = $html =~ m/(Hail Reports)(.*?)(Wind Reports)/sm 
       or die "No Hail/Wind Reports"; 

     print qq{ 
       Hail:   $hail 
       Wind:   $wind 
       Between Text: $between 
      }; 
    } 

    main();

來源

2010-07-03 00:48:14

perl HTML解析的一些幫助

回答

相關問題