2014-12-27 66 views
0

我嘗試從捲曲網站中取消某個日期。這裏是我的代碼:捲曲廢料:錯誤集曲奇餅乾

$ch = curl_init(); 
curl_setopt($ch, CURLOPT_URL, 'http://www.jstor.org/action/doBasicSearch?Query=Les+bourgeois'); 
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); 
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); 
curl_setopt($ch, CURLOPT_USERAGENT, random_user_agent()); 
$result7 = htmlspecialchars_decode(curl_exec ($ch)); 
curl_close($ch); 

$html7 = new simple_html_dom(); 
$html7->load($result7); 

但我有以下警告錯誤:

Warning: file_get_contents(<!DOCTYPE html> <html xmlns:mml=" http://www.w3.org/1998/Math/MathML&quot ; lang="en" > <head> <script type="text/javascript"> var JiffyParams = { jsStart: (new Date()).getTime()}; </script> <meta name="robots" content="noarchive,noindex,nofollow,NOODP" /> <meta name="MSSmartTagsPreventParsing" content="true"/> <title>JSTOR: An Error Occurred Setting Your User Cookie</title> <meta charset="UTF-8"/> <link rel="shortcut icon" href="/templates/jsp/favicon.ico" type="image/vnd.microsoft.icon" /> <link rel="stylesheet" type="text/css" media="screen" href="/jawrcss/N815843185/bundles/jstor.css" /> <link rel="stylesheet" type="text/css" href="//fonts.googleapis.com/css?family=Roboto:400,5 in C:\wamp\www\scrap_cairn\simple_html_dom.php on line 76

我不明白什麼是我的錯,我與捲曲初學者...也許我有從Jstor設置一些cookies,但我不知道該怎麼做。感謝您的幫助。

編輯:

我只是說這和錯誤更改:

$ch = curl_init(); 
    curl_setopt($ch, CURLOPT_URL, 'http://www.jstor.org/action/doBasicSearch?Query=Les+bourgeois'); 
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); 
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); 
    curl_setopt($ch, CURLOPT_USERAGENT, random_user_agent()); 
    curl_setopt($ch, CURLOPT_COOKIEJAR, 'cookies.txt'); 
    curl_setopt($ch, CURLOPT_COOKIEFILE, 'cookies.txt'); 
    $result7 = htmlspecialchars_decode(curl_exec ($ch)); 
    curl_close($ch); 

錯誤:

警告:!的file_get_contents(< DOCTYPE HTML > < - [如果IE 8 ] > < html class = " no-js lt-ie9 " lang = " en " > < [ENDIF] - > <! - [如果GT IE 8] > <! - > < HTML類= "沒有-JS " LANG = "烯" > <! - < [ENDIF] - - > <頭> <腳本類型= "文本/ JavaScript的" >(window.NREUM ||(NREUM = {}))loader_config = {Xpid中:" VwACUF9VGwsGXVRbAwA = "}; window.NREUM ||(NREUM = {} ),函數r(n){if(!e [n]){var o = e [n] = {exports:{}}; t [n] [0] .call(o.exports,function(e){var o = t [n] [1] [e]; return r(o?o:e)},o,o.exports )} return e [n] .exports} if(" function " == typeof __nr_require)return __nr_require; for(var o = 0; o < n.length; o ++)r(n [o]); return r}( {函數(t,e){函數n(t){函數e(e,n,a){t& t(e,n,a),a ||(a = {}); for (var c = s(e),f = c.length,u = i(a,o,r),d = 0; f > d; d ++)c [d] .apply(u,n); return u }函數a(t,e){f [t] = s(t).concat(e)}函數s(t){return f [t] || []}函數c(){return n(e) } var f = {};返回{on:a,emit:e,create:c,listeners:s,_events:在C:\ wamp \ www \ scrap_cairn \ simple_html_dom.php上線76

我添加一段代碼from simple_html_dom about the line 76:

function file_get_html($url, $use_include_path = false, $context=null, $offset = -1, $maxLen=-1, $lowercase = true, $forceTagsClosed=true, $target_charset = DEFAULT_TARGET_CHARSET, $stripRN=true, $defaultBRText=DEFAULT_BR_TEXT, $defaultSpanText=DEFAULT_SPAN_TEXT) 
{ 
    // We DO force the tags to be terminated. 
    $dom = new simple_html_dom(null, $lowercase, $forceTagsClosed, $target_charset, $stripRN, $defaultBRText, $defaultSpanText); 
    // For sourceforge users: uncomment the next line and comment the retreive_url_contents line 2 lines down if it is not already done. 
    $contents = file_get_contents($url, $use_include_path, $context, $offset); 
    // Paperg - use our own mechanism for getting the contents as we want to control the timeout. 
    //$contents = retrieve_url_contents($url); 
    if (empty($contents) || strlen($contents) > MAX_FILE_SIZE) 
    { 
     return false; 
    } 
    // The second parameter can force the selectors to all be lowercase. 
    $dom->load($contents, $lowercase, $stripRN); 
    return $dom; 
} 

回答

0

確定file_get_html()是做這件事的好方法嗎?這個函數調用file_get_contents(),它打開一個URI,並傳遞一個字符串(包含你的HTML數據)。

我認爲從PHP str_get_html()簡單的HTML DOM將是好方法。

+0

中添加來自simple_hteml_dom的代碼段謝謝,它的工作原理! ;) – AlphaNico 2014-12-27 23:00:45

0

餅乾是瀏覽器的東西。

curl是一個系統的東西(bash或linux或其他)。

php包裝捲曲(有時實際上編譯庫內)。這或多或少是一個系統調用(沒有瀏覽器參與)

因此,你需要用捲曲設置cookies:

http://curl.haxx.se/docs/http-cookies.html

但你是正確的 -

+0

謝謝,我如何從Jstor獲取曲奇以設置Curl?我可以使用CURLOPT_COOKIEJAR和CURLOPT_COOKIEFILE之後嗎? – AlphaNico 2014-12-27 20:43:01

+0

爲什麼你需要這個:「CURLOPT_FOLLOWLOCATION」?也許是餅乾的事情 - 更多要遵循。你爲我工作的代碼 - 很好。但是,我沒有使用new_simple_html_dom()。我設置user_agent – terary 2014-12-27 20:44:48

+0

我更新了我的問題。文章:我感謝這是最初的問題,但是當我刪除跟蹤位置時,它不會改變任何東西。 – AlphaNico 2014-12-27 20:50:27