如何使用Perl提取或更改HTML中的鏈接？

我有這個輸入文本：如何使用Perl提取或更改HTML中的鏈接？

<html><head><meta http-equiv="content-type" content="text/html; charset=utf-8"></head><body><table cellspacing="0" cellpadding="0" border="0" align="center" width="603"> <tbody><tr>  <td><table cellspacing="0" cellpadding="0" border="0" width="603">  <tbody><tr>   <td width="314"><img height="61" width="330" src="/Elearning_Platform/dp_templates/dp-template-images/awards-title.jpg" alt="" /></td>   <td width="273"><img height="61" width="273" src="/Elearning_Platform/dp_templates/dp-template-images/awards.jpg" alt="" /></td>  </tr>  </tbody></table></td> </tr> <tr>  <td><table cellspacing="0" cellpadding="0" border="0" align="center" width="603">  <tbody><tr>   <td colspan="3"><img height="45" width="603" src="/Elearning_Platform/dp_templates/dp-template-images/top-bar.gif" alt="" /></td>  </tr>  <tr>   <td background="/Elearning_Platform/dp_templates/dp-template-images/left-bar-bg.gif" width="12"><img height="1" width="12" src="/Elearning_Platform/dp_templates/dp-template-images/left-bar-bg.gif" alt="" /></td>   <td width="580"><p>&nbsp;what y all heard?</p><p>i'm shark oysters.</p>    <p>&nbsp;</p>    <p>&nbsp;</p>    <p>&nbsp;</p>    <p>&nbsp;</p>    <p>&nbsp;</p>    <p>&nbsp;</p></td>   <td background="/Elearning_Platform/dp_templates/dp-template-images/right-bar-bg.gif" width="11"><img height="1" width="11" src="/Elearning_Platform/dp_templates/dp-template-images/right-bar-bg.gif" alt="" /></td>  </tr>  <tr>   <td colspan="3"><img height="31" width="603" src="/Elearning_Platform/dp_templates/dp-template-images/bottom-bar.gif" alt="" /></td>  </tr>  </tbody></table></td> </tr> </tbody></table> <p>&nbsp;</p></body></html>

正如你所看到的，在這一塊HTML文本的不換行，我需要尋找裏面的所有圖片鏈接，將它們複製出來到一個目錄，並改變文字內部的行像./images/file_name。

目前，我使用看起來像這樣的Perl代碼：

my ($old_src,$new_src,$folder_name); 
    foreach my $record (@readfile) { 
     ## so the if else case for the url replacement block below will be correct 
     $old_src = ""; 
     $new_src = ""; 
     if ($record =~ /\<img(.+)/){ 
      if($1=~/src=\"((\w|_|\\|-|\/|\.|:)+)\"/){ 
       $old_src = $1; 
       my @tmp = split(/\/Elearning/,$old_src); 
       $new_src = "/media/www/vprimary/Elearning".$tmp[-1]; 
       push (@images, $new_src); 
       $folder_name = "images"; 
      }## end if 
     } 
     elsif($record =~ /background=\"(.+\.jpg)/){ 
      $old_src = $1; 
      my @tmp = split(/\/Elearning/,$old_src); 
      $new_src = "/media/www/vprimary/Elearning".$tmp[-1]; 
      push (@images, $new_src); 
      $folder_name = "images"; 
     } 
     elsif($record=~/\<iframe(.+)/){ 
      if($1=~/src=\"((\w|_|\\|\?|=|-|\/|\.|:)+)\"/){ 
       $old_src = $1; 
       my @tmp = split(/\/Elearning/,$old_src); 
       $new_src = "/media/www/vprimary/Elearning".$tmp[-1]; 
       ## remove the ?rand behind the html file name 
       if($new_src=~/\?rand/){ 
        my ($fname,$rand) = split(/\?/,$new_src); 
        $new_src = $fname; 
        my ($fname,$rand) = split(/\?/,$old_src); 
        $old_src = $fname."\\?".$rand; 
       } 
     print "old_src::$old_src\n"; ##s7test 
     print "new_src::$new_src\n\n"; ##s7test 
       push (@iframes, $new_src); 
       $folder_name = "iframes"; 
      }## end if 
     }## end if 

     my $new_record = $record; 
     if($old_src && $new_src){ 
      $new_record =~ s/$old_src/$new_src/ ; 
    print "new_record:$new_record\n"; ##s7test 
      my @tmp = split(/\//,$new_src); 
      $new_record =~ s/$new_src/\.\\$folder_name\\$tmp[-1]/; 
## print "new_record2:$new_record\n\n"; ##s7test 
     }## end if 
     print WRITEFILE $new_record; 
    } # foreach

這僅僅是足以處理HTML文本與他們換行。我想只循環正則表達式，但我不得不將匹配行更改爲其他文本。

你有什麼想法，如果有一個優雅的Perl方式來做到這一點？或者也許我太愚蠢，看不到明顯的做法，另外我知道把全局選項不起作用。

謝謝。〜steve

來源

2008-12-12 melaos

htmlRegexParserQuestions ++（顯然，必須有一個每一天） – Tomalak 2008-12-12 07:09:08

如果必須避免任何額外的模塊，如HTML解析器，你可以嘗試：

while ($string =~ m/(?:\<\s*(?:img|iframe)[^\>]+src\s*=\s*\"((?:\w|_|\\|-|\/|\.|:)+)\"|background\s*=\s*\"([^\>]+\.jpg)|\<\s*iframe)/g) { 
    $old_src = $1; 
      my @tmp = split(/\/Elearning/,$old_src); 
        $new_src = "/media/www/vprimary/Elearning".$tmp[-1]; 
    if($new_src=~/\?rand/){ 
    // remove rand and push in @iframes 
    else 
    { 
    // push into @images 
    } 
}

這樣的話，你將適用於所有的源這個表達式（包括換行），並有一個更緊湊碼（加，你會考慮屬性和它們的值之間的任何額外的空間）

來源

2008-12-12 07:10:20 VonC

有很棒的Perl語言的HTML解析器，學會使用它們並堅持下去。 HTML是複雜的，允許>屬性，大量使用嵌套等。使用正則表達式解析它，除了非常簡單的任務（或機器生成的代碼）之外，很容易出現問題。

來源

2008-12-12 06:22:46 PhiLho

您好，我使用MOD Perl和我們在UNIX上運行，我需要管理層的批准添加一個模塊，所以希望能找到一個簡單的perl的方式來完成它，或者在mod perl中默認模塊。謝謝 – melaos 2008-12-12 06:25:03

好吧，你可以隨時看看模塊的來源。至於管理層，你可以告訴他們有人已經做得正確，如果你使用現有的正確解決方案，他們可以節省時間和金錢，並且可以進入下一個問題。 – 2008-12-12 07:49:12

是有道理的，我寧願使用測試證明的方法，另一個我可怕的黑客...希望我尖尖的頭髮老闆責備。 – melaos 2008-12-12 08:10:28

我想你想我HTML::SimpleLinkExtor模塊：

 
use HTML::SimpleLinkExtor; 

my $extor = HTML::SimpleLinkExtor->new; 
$extor->parse_file($file); 

my @imgs = $extor->img;

我不是確定你想要做什麼，但它肯定聽起來像一個HTML解析模塊應該做的伎倆，如果我的不。

來源

2008-12-12 07:43:09

如何使用Perl提取或更改HTML中的鏈接？

回答

相關問題