Scrapy Splash截圖？

我試圖在抓取每個頁面的截圖的同時抓取一個網站。到目前爲止，我已設法拼湊下面的代碼：Scrapy Splash截圖？

import json 
import base64 
import scrapy 
from scrapy_splash import SplashRequest 


class ExtractSpider(scrapy.Spider): 
    name = 'extract' 

    def start_requests(self): 
     url = 'https://stackoverflow.com/' 
     splash_args = { 
      'html': 1, 
      'png': 1 
     } 
     yield SplashRequest(url, self.parse_result, endpoint='render.json', args=splash_args) 

    def parse_result(self, response): 
     png_bytes = base64.b64decode(response.data['png']) 

     imgdata = base64.b64decode(png_bytes) 
     filename = 'some_image.png' 
     with open(filename, 'wb') as f: 
      f.write(imgdata)

它獲取到網站上的精細（例如，計算器），並返回數據png_bytes，但是當寫入一個文件 - 返回一個破碎的形象（沒有按加載）。

有沒有辦法解決這個問題，或者找到更有效的解決方案？我已經讀過，Splash Lua Scripts可以做到這一點，但一直無法找到實現這一點的方法。謝謝。

來源

2017-07-18 Exam Orph

您是從解碼的base64兩次：

 png_bytes = base64.b64decode(response.data['png']) 
     imgdata = base64.b64decode(png_bytes)

簡單地做：

def parse_result(self, response): 
     imgdata = base64.b64decode(response.data['png']) 
     filename = 'some_image.png' 
     with open(filename, 'wb') as f: 
      f.write(imgdata)

來源

2017-07-18 16:41:33

太謝謝你了 - 非常有幫助！ –

如果你不介意，還有一個問題是，你知道如何截圖整個頁面嗎？我嘗試將'render_all'設置爲True作爲參數的一部分，但得到以下錯誤：'警告：對Splash的錯誤請求：{'info'：{'argument'：'render_all'，'type'：'bad_argument'，'description '：'通過非零'等待'呈現完整的網頁「}，'type'：'BadOption'，'description'：'錯誤的HTTP API參數'，'error'：400}' –

我找到了解決方案 - 它是延遲，讓完整的渲染髮生！現在全部排序，再次感謝您的幫助。 –

Scrapy Splash截圖？

回答

相關問題