2012-11-02 105 views
0

我被要求從頁面抓取某一行,但看起來該網站阻止了CURL請求?從阻止CURL的頁面抓取HTML

有問題的網站是http://www.habbo.com/home/Intricat

我試圖改變用戶代理,看看他們是否被阻斷,但它似乎沒有這樣的伎倆。

我使用的代碼如下:

<?php 

$curl_handle=curl_init(); 
//This is the URL you would like the content grabbed from 
curl_setopt($curl_handle, CURLOPT_USERAGENT, "Mozilla/5.0"); 
curl_setopt($curl_handle,CURLOPT_URL,'http://www.habbo.com/home/Intricat'); 
//This is the amount of time in seconds until it times out, this is useful if the server you are requesting data from is down. This way you can offer a "sorry page" 
curl_setopt($curl_handle,CURLOPT_CONNECTTIMEOUT,2); 

curl_setopt($curl_handle,CURLOPT_RETURNTRANSFER,1); 
$buffer = curl_exec($curl_handle); 
//This Keeps everything running smoothly 
curl_close($curl_handle); 

// Change the message bellow as you wish, please keep in mind you must have your message within the " " Quotes. 
if (empty($buffer)) 
{ 
    print "Sorry, It seems our weather resources are currently unavailable, please check back later."; 
} 
else 
{ 
    print $buffer; 
} 
?> 

的另一種方式我可以抓住的代碼,如果他們已經封鎖捲曲請求該頁面線任何想法?

編輯:在運行curl -i通過我的服務器,它顯示該網站首先設置cookie?

+0

嘗試使用代理並設置推薦鏈接 – Waygood

+0

*「我們的天氣資源」*? - 我敢肯定你的意思是habbo.com的天氣資源,對吧? – hakre

+0

這只是一個隨機站點的代碼。忽略該部分:P – Tenatious

回答

1

你對於你正在談論的區塊類型並不是非常具體。

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"> 
<html> 
<head> 
    <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"> 
    <meta http-equiv="Content-Script-Type" content="text/javascript"> 
    <script type="text/javascript">function setCookie(c_name, value, expiredays) { 
     var exdate = new Date(); 
     exdate.setDate(exdate.getDate() + expiredays); 
     document.cookie = c_name + "=" + escape(value) + ((expiredays == null) ? "" : ";expires=" + exdate.toGMTString()) + ";path=/"; 
    } 
    function getHostUri() { 
     var loc = document.location; 
     return loc.toString(); 
    } 
    setCookie('YPF8827340282Jdskjhfiw_928937459182JAX666', '179.222.19.192', 10); 
    setCookie('DOAReferrer', document.referrer, 10); 
    location.href = getHostUri();</script> 
</head> 
<body> 
<noscript>This site requires JavaScript and Cookies to be enabled. Please change your browser settings or upgrade your 
    browser. 
</noscript> 
</body> 
</html> 

由於捲曲沒有JavaScript的支持,您可能需要使用一個HTTP客戶端時,你需要模仿腳本 - 或 - 和:如果瀏覽器已啓用JavaScript,問題http://www.habbo.com/home/Intricat網站做了所有檢查的第一創建您自己的cookie和新的請求URI。

+0

我會如何去模仿這個? – Tenatious

+1

您可以通過閱讀javascript代碼並理解它的功能來模仿它。然後,您將該知識轉換爲PHP代碼並轉換爲curl請求配置。可以這麼說,你只需在瀏覽器中完成javascript的工作即可。只需在PHP中而不是JavaScript併兼容捲曲。您可能需要解析HTML和JavaScript。對於HTML解析我強烈建議PHP的'DOMDocument'。第一課是在這裏提取'

1

請使用瀏覽器並複製正在發送的確切標頭, 由於請求看起來完全一樣,網站將無法辨別您正在嘗試捲曲。 如果使用cookie - 將它們作爲標題附加。

+0

您能否詳細介紹一下我的這個? – Tenatious

1

這是從我的捲髮課上剪下來的貼子,我做了好幾年,希望你能爲自己挑選一些寶石。

function get_url($url) 
{ 
    curl_setopt ($this->ch, CURLOPT_URL, $url); 
    curl_setopt ($this->ch, CURLOPT_USERAGENT, $this->user_agent); 
    curl_setopt ($this->ch, CURLOPT_COOKIEFILE, $this->cookie_name); 
    curl_setopt ($this->ch, CURLOPT_COOKIEJAR, $this->cookie_name); 
    if(!is_null($this->referer)) 
    { 
     curl_setopt ($this->ch, CURLOPT_REFERER, $this->referer); 
    } 
    curl_setopt ($this->ch, CURLOPT_SSL_VERIFYHOST, 2); 
    curl_setopt ($this->ch, CURLOPT_HEADER, 0); 
    if($this->follow) 
    { 
     curl_setopt ($this->ch, CURLOPT_FOLLOWLOCATION, 1); 
    } 
    else 
    { 
     curl_setopt ($this->ch, CURLOPT_FOLLOWLOCATION, 0); 
    } 
    curl_setopt ($this->ch, CURLOPT_RETURNTRANSFER, 1); 
    curl_setopt ($this->ch, CURLOPT_HTTPHEADER, array("Accept: text/html,text/vnd.wap.wml,*.*")); 
    curl_setopt ($this->ch, CURLOPT_SSL_VERIFYPEER, FALSE); // this line makes it work under https 

    $try=0; 
    $result=""; 
    while(($try<=$this->retry_attempts) && (empty($result))) // force a retry upto 5 times 
    { 
     $try++; 
     $result = curl_exec($this->ch); 
     $this->response=curl_getinfo($this->ch); 
     // $response['http_code'] 4xx is an error 
    } 
    // set refering URL to current url for next page. 
    if($this->referer_to_last) $this->set_referer($url); 

    return $result; 
} 
+0

$ cookie_name =「./ cookie」;確保您的腳本具有對您選擇的目錄的寫入權限 – Waygood

+0

致命錯誤:在不在對象上下文中時使用$ this – Tenatious

+1

__cut並從我的Curl類中粘貼_ – Waygood

0

我知道這是一個很老的帖子,但是因爲我今天不得不回答自己同一個問題,所以我在這裏分享給大家,它可能對他們有用。我也完全知道OP特別要求curl,但和我一樣 - 可能有人對解決方案感興趣,無論是否curl

我想用curl獲取的頁面將其屏蔽。如果塊因爲javascript,但因爲代理(這是我的情況,並在curl設置代理沒有幫助),那麼wget可能是一個解決辦法:

wget -o output.txt --no-check-certificate --user-agent="Mozilla/5.0 (Windows NT 5.2; rv:2.0.1) Gecko/20100101 Firefox/4.0.1" "http://example.com/page"