2009-06-13 137 views
3

我正在使用cURL構建基本鏈接檢查器。我的應用程序有一個名爲getHeaders()函數返回的HTTP標頭的數組:與mail.google.com,cURL和http://validator.w3.org/checklink混淆

function getHeaders($url) { 

    if(function_exists('curl_init')) { 
     // create a new cURL resource 
     $ch = curl_init(); 
     // set URL and other appropriate options 
     $options = array(
      CURLOPT_URL => $url, 
      CURLOPT_HEADER => true, 
      CURLOPT_NOBODY => true, 
      CURLOPT_FOLLOWLOCATION => 1, 
      CURLOPT_RETURNTRANSFER => true); 
     curl_setopt_array($ch, $options); 
     // grab URL and pass it to the browser 
     curl_exec($ch); 
     $headers = curl_getinfo($ch); 
     // close cURL resource, and free up system resources 
     curl_close($ch); 
    } else { 
     echo "

Error: cURL is not installed on the web server. Unable to continue.

"; return false; } return $headers; } print_r(getHeaders('mail.google.com'));

其產生以下結果:

Array 
(
    [url] => http://mail.google.com 
    [content_type] => text/html; charset=UTF-8 
    [http_code] => 404 
    [header_size] => 338 
    [request_size] => 55 
    [filetime] => -1 
    [ssl_verify_result] => 0 
    [redirect_count] => 0 
    [total_time] => 0.128 
    [namelookup_time] => 0.042 
    [connect_time] => 0.095 
    [pretransfer_time] => 0.097 
    [size_upload] => 0 
    [size_download] => 0 
    [speed_download] => 0 
    [speed_upload] => 0 
    [download_content_length] => 0 
    [upload_content_length] => 0 
    [starttransfer_time] => 0.128 
    [redirect_time] => 0 
)

我和幾個長期的聯繫進行了測試,並且功能承認重定向,除了mail.google.com似乎。

爲了好玩,我通過相同的URL(mail.google.com)W3C的鏈接檢查,這就產生:

Results 

Links 

Valid links! 

List of redirects 

The links below are not broken, but the document does not use the exact URL, and the links were redirected. It may be a good idea to link to the final location, for the sake of speed. 

warning Line: 1 http://mail.google.com/mail/ redirected to 

https://www.google.com/accounts/ServiceLogin?service=mail&passive=true&rm=false&continue=http%3A%2F%2Fmail.google.com%2Fmail%2F%3Fui%3Dhtml%26zy%3Dl&bsv=zpwhtygjntrz&scc=1&ltmpl=default&ltmplcache=2 

Status: 302 -> 200 OK 

This is a temporary redirect. Update the link if you believe it makes sense, or leave it as is. 

Anchors 

Found 0 anchors. 

Checked 1 document in 4.50 seconds.

哪個是正確的,因爲上面的地址是哪裏,我重定向到何時我在我的瀏覽器中輸入mail.google.com。

我需要使用哪些cURL選項才能使mail.google.com的函數返回200?

爲什麼上面的函數返回404狀態碼而不是302狀態碼?

TIA

回答

0

難道

mail.google.com -> mail.google.com/mail is a 404 and then a hard redirect 

mail.google.com/mail -> https://www.google.com/accounts... etc is a 302 redirect 
4

的問題是,重定向通過捲曲不會跟隨方法來指定。

這裏是http://mail.google.com響應:

HTTP/1.1 200 OK 
Cache-Control: public, max-age=604800 
Expires: Mon, 22 Jun 2009 14:58:18 GMT 
Date: Mon, 15 Jun 2009 14:58:18 GMT 
Refresh: 0;URL=http://mail.google.com/mail/ 
Content-Type: text/html; charset=ISO-8859-1 
X-Content-Type-Options: nosniff 
Transfer-Encoding: chunked 
Server: GFE/1.3 

<html> 
<head> 
    <meta http-equiv="Refresh" content="0;URL=http://mail.google.com/mail/" /> 
</head> 
<body> 
    <script type="text/javascript" language="javascript"> 
    <!-- 
    location.replace("http://mail.google.com/mail/") 
    --> 
    </script> 
</body> 
</html> 

正如你可以看到,該頁面同時使用一個刷新標題(和HTML元當量)和JavaScript的身體改變位置http://mail.google.com/mail/

如果您然後請求http://mail.google.com/mail/,您將被重定向到(前面提到的cURL位置標頭)到W3C正確標識的頁面。

HTTP/1.1 302 Moved Temporarily 
Cache-Control: no-cache, no-store, max-age=0, must-revalidate 
Pragma: no-cache 
Expires: Fri, 01 Jan 1990 00:00:00 GMT 
Date: Mon, 15 Jun 2009 15:07:56 GMT 
Location: https://www.google.com/accounts/ServiceLogin?service=mail&passive=true&rm=false&continue=http%3A%2F%2Fmail.google.com%2Fmail%2F%3Fui%3Dhtml%26zy%3Dl&bsv=zpwhtygjntrz&scc=1&ltmpl=default&ltmplcache=2 
Content-Type: text/html; charset=UTF-8 
X-Content-Type-Options: nosniff 
Transfer-Encoding: chunked 
Server: GFE/1.3 

HTTP/1.1 200 OK 
Content-Type: text/html; charset=UTF-8 
Cache-control: no-cache, no-store 
Pragma: no-cache 
Expires: Mon, 01-Jan-1990 00:00:00 GMT 
Set-Cookie: GALX=B8zH60M78Ys;Path=/accounts;Secure 
Date: Mon, 15 Jun 2009 15:07:56 GMT 
X-Content-Type-Options: nosniff 
Content-Length: 19939 
Server: GFE/2.0 

(HTML page content here, removed) 

也許您應該在腳本中添加一個額外的步驟來檢查Refresh標題。

另一個可能的錯誤是,您的PHP配置中設置了open_basedir,這會禁用CURLOPT_FOLLOWLOCATION - 您可以通過打開錯誤報告來快速檢查此問題,因爲生成警告或通知時會生成消息。

$useragent="Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.5) Gecko/2008120122 Firefox/3.0.5"; 
$ch = curl_init(); 
curl_setopt($ch, CURLOPT_URL, $url); 
curl_setopt($ch, CURLOPT_AUTOREFERER, 1); 
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1); 
curl_setopt($ch, CURLOPT_HEADER, 1); 
curl_setopt($ch, CURLOPT_USERAGENT, $useragent); 
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); 

$res = curl_exec($ch); 

curl_close($ch); 

上述結果都與下面的捲曲設置獲得