抓取CasperJS或PhantomJS中的資源內容

2

沒有意識到我可以抓住從文檔對象的源是這樣的：

casper.start(url, function() { 
    var js = this.evaluate(function() { 
     return document; 
    }); 
    this.echo(js.all[0].outerHTML); 
});

更多信息。

來源

2012-10-24 17:07:59 iwek

1

您可以使用Casper.debugHTML()打印出HTML資源的內容：使用casper.getPageContent()

var casper = require('casper').create(); 

casper.start('http://google.com/', function() { 
    this.debugHTML(); 
}); 

casper.run();

您也可以存儲在一個變種的HTML內容：http://casperjs.org/api.html#casper.getPageContent（提供最新的大師）

來源

2012-07-18 05:05:58 NiKo

+1

感謝NIKO，我想我並不清楚，但是我正在尋找其他所有的資源，而不是HTML頁面。我想將外部css或js文件存儲在var中，這些資源的內容是可能的嗎？ – iwek 2012-07-18 12:33:41

+0

只是確保你設置權協議（即HTTP和https）..我花了一段時間我試圖打開從http被重定向網站找出爲https ..那哽咽casperjs（錯誤？） – abbood 2013-04-09 15:25:38

+0

@ iwek請參閱此鏈接瞭解更多關於如何保存資源到磁盤：http://stackoverflow.com/questions/24582307/how-to-save-the-current-webpage-with-casperjs-phantomjs通過HTTP作爲回答： //stackoverflow.com/users/1816580/artjom-b – iChux 2015-01-08 08:05:44

16

我我們發現，直到幻影成熟一點，根據問題158 http://code.google.com/p/phantomjs/issues/detail?id=158這對他們來說有點頭疼。

所以你想這樣做嗎？我選擇去高一點做到這一點，並在https://github.com/allfro/pymiproxy已經抓住了PyMiProxy，下載，安裝，設置它，把他們的示例代碼和proxy.py使這個

from miproxy.proxy import RequestInterceptorPlugin, ResponseInterceptorPlugin, AsyncMitmProxy 
from mimetools import Message 
from StringIO import StringIO 

class DebugInterceptor(RequestInterceptorPlugin, ResponseInterceptorPlugin): 

     def do_request(self, data): 
      data = data.replace('Accept-Encoding: gzip\r\n', 'Accept-Encoding:\r\n', 1); 
      return data 

     def do_response(self, data): 
      #print '<< %s' % repr(data[:100]) 
      request_line, headers_alone = data.split('\r\n', 1) 
      headers = Message(StringIO(headers_alone)) 
      print "Content type: %s" %(headers['content-type']) 
      if headers['content-type'] == 'text/x-comma-separated-values': 
       f = open('data.csv', 'w') 
       f.write(data) 
      print '' 
      return data 

if __name__ == '__main__': 
    proxy = AsyncMitmProxy() 
    proxy.register_interceptor(DebugInterceptor) 
    try: 
     proxy.serve_forever() 
    except KeyboardInterrupt: 
     proxy.server_close()

然後我火了高達

python proxy.py

接下來我執行phantomjs與指定的代理...

phantomjs --ignore-ssl-errors=yes --cookies-file=cookies.txt --proxy=127.0.0.1:8080 --web-security=no myfile.js

您可能需要把你的安全或如此，這是不必要的，我目前的我米只是一個來源。您現在應該可以看到一大堆文本流經代理控制檯，如果它使用mime類型的「text/x-comma-separated-values」登錄，它會將其另存爲data.csv。這也可以保存所有的標題和所有內容，但是如果你已經到了這個地方，我相信你可以弄清楚如何將這些內容排除。

另一個細節，我發現我必須禁用gzip編碼，我可以使用zlib並從我自己的apache web服務器中解壓縮gzip中的數據，但是如果它出自IIS或解壓縮將得到錯誤，我不確定它的那一部分。

那麼我的電力公司不會給我一個API？精細！我們這樣做很難！

來源

2012-08-24 21:34:46 Xedecimal

+0

絕妙的主意！ – NiKo 2012-10-04 06:27:50

+0

感謝這Xedecimal。 – iwek 2012-10-18 17:16:02

16

在過去的幾天裏，這個問題一直存在。代理解決方案在我的環境中並不是很乾淨，所以我發現phantomjs的QTNetworking內核在緩存資源時會放置哪些資源。

長話短說，這是我的要點。您需要cache.js和mimetype.js文件： https://gist.github.com/bshamric/4717583

//for this to work, you have to call phantomjs with the cache enabled: 
//usage: phantomjs --disk-cache=true test.js 

var page = require('webpage').create(); 
var fs = require('fs'); 
var cache = require('./cache'); 
var mimetype = require('./mimetype'); 

//this is the path that QTNetwork classes uses for caching files for it's http client 
//the path should be the one that has 16 folders labeled 0,1,2,3,...,F 
cache.cachePath = '/Users/brandon/Library/Caches/Ofi Labs/PhantomJS/data7/'; 

var url = 'http://google.com'; 
page.viewportSize = { width: 1300, height: 768 }; 

//when the resource is received, go ahead and include a reference to it in the cache object 
page.onResourceReceived = function(response) { 
    //I only cache images, but you can change this 
    if(response.contentType.indexOf('image') >= 0) 
    { 
     cache.includeResource(response); 
    } 
}; 

//when the page is done loading, go through each cachedResource and do something with it, 
//I'm just saving them to a file 
page.onLoadFinished = function(status) { 
    for(index in cache.cachedResources) { 
     var file = cache.cachedResources[index].cacheFileNoPath; 
     var ext = mimetype.ext[cache.cachedResources[index].mimetype]; 
     var finalFile = file.replace("."+cache.cacheExtension,"."+ext); 
     fs.write('saved/'+finalFile,cache.cachedResources[index].getContents(),'b'); 
    } 
}; 

page.open(url, function() { 
    page.render('saved/google.pdf'); 
    phantom.exit(); 
});

然後當你調用phantomjs，只要確保啓用緩存：

phantomjs --disk緩存=真正的考驗。js

一些說明：我寫這是爲了在不使用代理或獲取低分辨率快照的情況下獲取頁面上的圖像。 QT在某些文本文件資源上使用壓縮，如果將此文件用於文本文件，則必須處理解壓縮。另外，我運行了一個快速測試來獲取html資源，並且它不會將結果中的http頭解析出來。但是，這對我很有用，希望別人能夠找到它，如果您遇到特定內容類型的問題，請對其進行修改。

來源

2013-02-05 21:14:56 brandon

+1

你怎麼decmpress？ – KJW 2013-10-15 23:37:19

+0

真的很想知道你是如何解壓的。你管理它了嗎？ – 2015-02-28 14:43:39

+0

你是先生，是一名騎兵。謝謝你。 – 2015-05-14 03:12:18

抓取CasperJS或PhantomJS中的資源內容

回答

相關問題