2010-09-22 68 views
1

在過去的兩天裏,我一直在敲一個小型的寵物項目,其中包括在Perl中製作爬蟲。如何讓我的Perl網絡爬蟲更快?

我在Perl中沒有真正的經驗(只有我在過去兩天學到的東西)。 我的腳本如下:

ACTC.pm:

#!/usr/bin/perl 
use strict; 
use URI; 
use URI::http; 
use File::Basename; 
use DBI; 
use HTML::Parser; 
use LWP::Simple; 
require LWP::UserAgent; 
my $ua = LWP::UserAgent->new; 
$ua->timeout(10); 
$ua->env_proxy; 
$ua->max_redirect(0); 


package Crawler; 
sub new { 
    my $class = shift; 
    my $self = { 
     _url => shift, 
     _max_link => 0, 
     _local => 1, 
    }; 
    bless $self, $class; 
    return $self; 

} 
sub trim{ 
    my($self, $string) = @_; 
    $string =~ s/^\s+//; 
    $string =~ s/\s+$//; 
    return $string; 
} 
sub process_image { 
    my ($self, $process_image) = @_; 
    $self->{_process_image} = $process_image; 
} 
sub local { 
    my ($self, $local) = @_; 
    $self->{_local} = $local; 
} 
sub max_link { 
    my ($self, $max_link) = @_; 
    $self->{_max_link} = $max_link; 
} 
sub x_more { 
    my ($self, $x_more) = @_; 
    $self->{_x_more} = $x_more; 
} 
sub resolve_href { 
    my ($self, $base, $href) = @_; 
    my $u = URI->new_abs($href, $base); 
    return $u->canonical; 
} 
sub write { 
    my ($self, $ref, $data) = @_; 
    open FILE, '>c:/perlscripts/' . $ref . '_' . $self->{_process_image} . '.txt'; 
    foreach($data) { 
     print FILE $self->trim($_) . "\n"; 
    } 
    close(FILE); 
} 
sub scrape { 
    my (@m_error_array, @m_href_array, @href_array, $dbh, $query, $result, $array); 
    my ($self, $DBhost, $DBuser, $DBpass, $DBname) = @_; 
    if(defined($self->{_process_image}) && (-e 'c:/perlscripts/href_w_' . $self->{_process_image} . ".txt")) { 
     open ERROR_W, "<c:/perlscripts/error_w_" . $self->{_process_image} . ".txt"; 
     open M_HREF_W, "<c:/perlscripts/m_href_w_" . $self->{_process_image} . ".txt"; 
     open HREF_W, "<c:/perlscripts/href_w_" . $self->{_process_image} . ".txt"; 
     @m_error_array = <ERROR_W>; 
     @m_href_array = <M_HREF_W>; 
     @href_array = <HREF_W>; 
     close (ERROR_W); 
     close (M_HREF_W); 
     close (HREF_W); 
    }else{ 
     @href_array = ($self->{_url}); 
    } 
    my $z = 0; 
    while(@href_array){ 
     if(defined($self->{_x_more}) && $z == $self->{_x_more}) { 
      print "died"; 
      last; 
     } 
     my $href = shift(@href_array); 
     if(defined($self->{_process_image}) && scalar @href_array ne 0) { 
      $self->write('m_href_w', @m_href_array); 
      $self->write('href_w', @href_array); 
      $self->write('error_w', @m_error_array); 
     } 
     $self->{_link_count} = scalar @m_href_array; 
     my $info = URI::http->new($href); 
     if(! defined($info->host)) { 
      push(@m_error_array, $href); 
     }else{ 
      my $host = $info->host; 
      $host =~ s/^www\.//; 
      $self->{_current_page} = $href; 
      my $redirect_limit = 10; 
      my $y = 0; 
      my($response, $responseCode); 
      while(1 && $y le $redirect_limit) { 
       $response = $ua->get($href); 
       $responseCode = $response->code; 
       if($responseCode == 200 || $responseCode == 301 || $responseCode == 302) { 
        if($responseCode == 301 || $responseCode == 302) { 
         $href = $self->resolve_href($href, $response->header('Location')); 
        }else{ 
         last; 
        } 
       }else{ 
        last; 
       } 
       $y++; 
      } 
      if($y != $redirect_limit && $responseCode == 200) { 
       print $href . "\n"; 
       if(! defined($self->{_url_list})) { 
        my @url_list = ($href); 
       }else{ 
        my @url_list = $self->{_url_list}; 
        push(@url_list, $href); 
        $self->{_url_list} = @url_list; 
       } 

       my $DNS = "dbi:mysql:$DBname:$DBhost:3306"; 
       $dbh = DBI->connect($DNS, $DBuser, $DBpass) or die $DBI::errstr; 

       $result = $dbh->prepare("INSERT INTO `". $host ."` (URL) VALUES ('$href')"); 
       if(! $result->execute()){ 
        $result = $dbh->prepare("CREATE TABLE `". $host ."` (`ID` INT(255) NOT NULL AUTO_INCREMENT , `URL` VARCHAR(255) NOT NULL , PRIMARY KEY (`ID`)) ENGINE = MYISAM ;"); 
        $result->execute(); 
        print "Host added: " . $host . "\n"; 
       } 


       my $content = $response->content; 
       die "get failed: " . $href if (!defined $content); 
       my @pageLinksArray = ($content =~ m/href=["']([^"']*)["']/g); 
       foreach(@pageLinksArray) { 
        my $link = $self->trim($_); 
        if($self->{_max_link} != 0 && scalar @m_href_array > $self->{_max_link}) { 
         last; 
        } 
        my $new_href = $self->resolve_href($href, $link); 
        if($new_href =~ m/^http:\/\//) { 
         if(substr($new_href, -1) ne "#") { 
          my $base = $self->{_url}; 
          my %values_index; 
          @values_index{@m_href_array} =(); 
          if(! $new_href =~ m/$base/) { 
           if($self->{_local} eq "true" && ! exists $values_index{$new_href}) { 
            push(@m_href_array, $new_href); 
            push(@href_array, $new_href); 
           } 
          }elsif($self->{_local} eq "true" && ! exists $values_index{$new_href}) { 
           push(@m_href_array, $new_href); 
           push(@href_array, $new_href); 
          } 
         } 
        } 
       }    
      }else{ 
       push(@m_error_array, $href); 
      } 
     } 
    } 
} 
1; 

new_spider.pl:

#!/usr/bin/perl 
use strict; 
use warnings; 
use ACTC; 

my ($object, $url, $uri); 
print "Starting Poing: (url): "; 
chomp($url = <>); 

$object = new Crawler($url); 
$object->process_image('process_image_name'); 
$object->local('true'); 
$object->max_link(0); 
$object->x_more(9999999); 
$object->scrape('localhost', 'root', '', 'crawl'); 

#print $object->{_url} . "\n"; 
#print $object->{_process_image}; 

現在,它沒有完成某些功能無法正常工作,但在運行腳本之後我在大約一個小時內編入了1500頁,我認爲這很慢。

該腳本開始鞭打的結果,但現在是相當長時間吐出一個網址每秒。

任何人都可以提供關於如何提高性能的任何提示?

+0

看起來像一個開放的「優化」問題,也許社區Wiki? – pascal 2010-09-22 08:10:39

+0

你能更具體嗎?你的CPU是100%嗎?或者你的網絡連接? – pascal 2010-09-22 08:11:03

+0

只是檢查,是的它是100%。 – 2010-09-22 08:19:13

回答

3

大多數情況下,您的程序可能正在等待網絡響應。在等待時間的大部分時間內沒有任何東西離開(除了將計算機放在要與之通話的計算機旁邊)。取消每個URL的過程,以便您可以同時下載它們。您可能會考慮諸如Parallel::ForkManager,POEAnyEvent之類的內容。

1

查看Brian的回答。

運行它的很多副本。使用共享存儲系統來保存中間數據和最終數據。

將抓取程序的更多內存密集型部分(HTML解析等)放入單獨的一組進程中可能會有幫助。

因此,有一個從隊列中讀取讀取並將它們放入共享存儲區的讀取器池,以及讀取頁面並將結果寫入結果數據庫並將新頁面排隊的解析器進程池進入隊列讀取。

什麼的。這真的取決於您的抓取工具的用途。

最終,如果您嘗試抓取大量網頁,您可能需要大量硬件和非常胖的管道(至您的數據中心/可樂)。因此,您需要一種架構,使得爬蟲的各個部分能夠在多臺機器上進行拆分,以便適當擴展。