2016-09-18 76 views
2

我正在使用一個腳本,使用WWW :: Mechanize從網站上抓取數據,除了網站本身,它的工作都很好。有時,它只是沒有了片刻迴應,對於給定的my $url = 'http://www.somesite.com/more/url/text'我會對$mech->get($url)此錯誤:在WWW中處理GET錯誤::機械化

Error GETing http://www.somesite.com/more/url/text: Can't connect to www.somesite.com:443 at ./trackSomesite.pl line 34. 

此錯誤是什麼,在一段時間沒有識別的模式,並從我的經驗,與一次發生我正在處理的網站,這是因爲服務器不穩定。

我希望能夠明確知道發生了此錯誤,而不是其他錯誤,如Too many requests。 我的問題是如何讓我的腳本來處理這個錯誤,而不是死?

回答

3

將您的$mech->get(...)請求包裝在一個評估塊中或使用autocheck => 0,然後檢查$mech->status代碼和/或$mech->status_line以決定要做什麼。

下面是一個例子:

#!/usr/bin/env perl 

use WWW::Mechanize; 

use constant RETRY_MAX => 5; 

my $url = 'http://www.xxsomesite.com/more/url/text'; # Cannot connect 

my $mech = WWW::Mechanize->new(autocheck => 0); 

my $content = fetch($url); 

sub fetch { 
    my ($url) = @_; 

    for my $retry (0 .. RETRY_MAX-1) { 
     my $message = "Attempting to fetch [ $url ]"; 
     $message .= $retry ? " - retry $retry\n" : "\n"; 
     warn $message; 

     my $response = $mech->get($url); 
     return $response->content() if $response->is_success(); 

     my $status = $response->status; 
     warn "status = $status\n"; 

     if ($response->status_line =~ /Can['']t connect/) { 
      $retry++; 
      warn "cannot connect...will retry after $retry seconds\n"; 
      sleep $retry; 
     } elsif ($status == 429) { 
      warn "too many requests...ignoring\n"; 
      return undef; 
     } else { 
      warn "something else...\n"; 
      return undef; 
     } 
    } 

    warn "giving up...\n"; 
    return undef; 
} 

輸出

Attempting to fetch [ http://www.xxsomesite.com/more/url/text ] 
status = 500 
cannot connect...will retry after 1 seconds 
Attempting to fetch [ http://www.xxsomesite.com/more/url/text ] - retry 1 
status = 500 
cannot connect...will retry after 2 seconds 
Attempting to fetch [ http://www.xxsomesite.com/more/url/text ] - retry 2 
status = 500 
cannot connect...will retry after 3 seconds 
Attempting to fetch [ http://www.xxsomesite.com/more/url/text ] - retry 3 
status = 500 
cannot connect...will retry after 4 seconds 
Attempting to fetch [ http://www.xxsomesite.com/more/url/text ] - retry 4 
status = 500 
cannot connect...will retry after 5 seconds 
giving up... 
+0

我不知道爲什麼你在字符類'[ '']'使用兩個撇?第二個是多餘的,撇號不是正則表達式模式中的特殊字符。只是'/無法連接/'很好。 – Borodin

+0

我相信@ikegami可以回答那個問題,因爲這是他的編輯。 – xxfelixxx

+0

這只是爲了修復語法突出顯示 – ikegami