XPath/Scrapy刮DOCTYPE

我正在構建一個使用Scrapy和XPath的刮板。我感興趣的是從我遍歷的所有站點獲取DOCTYPE，並且我很難找到關於此的文檔，並且覺得它應該是可能的，因爲它是一個相對簡單的請求。有什麼建議麼？

乾杯，

喬伊

下面是代碼，我到目前爲止有：

import scrapy 
from scrapy.selector import HtmlXPathSelector 
from scrapy.http import HtmlResponse 
from tutorial.items import DanishItem 
from scrapy.http import Request 
import csv 


class DanishSpider(scrapy.Spider): 
    name = "dmoz" 
    allowed_domains = [] 
    start_urls = [very long list of websites] 

    def parse(self, response): 
    for sel in response.xpath(???): 
     item = DanishItem() 
     item['website'] = response 
     item['DOCTYPE'] = sel.xpath('????').extract() 
     yield item

新蜘蛛，檢索DOCTYPE但由於某種原因將打印我到指定上傳.json響應文件15次而不是一次

class DanishSpider(scrapy.Spider): 
    name = "dmoz" 
    allowed_domains = [] 
    start_urls = ["http://wwww.example.com"] 

    def parse(self, response): 
    for sel in response.selector._root.getroottree().docinfo.doctype: 
     el = response.selector._root.getroottree().docinfo.doctype 
     item = DanishItem() 
     item['website'] = response 
     item['doctype'] = el 
     yield item

來源

2014-12-19 Joey Orlando

由於scrapy使用lxml爲默認選擇，您可以使用response.selector手柄從lxml得到這個信息，像這樣：

response.selector._root.getroottree().docinfo.doctype

這應該是足夠了，但如果你的另一種方法，請繼續閱讀。

您應該能夠通過使用scrapy的正則表達式提取器相同的信息：

response.selector.re("<!\s*DOCTYPE\s*(.*?)>")

但不幸的是，這會不會是由於這樣的事實lxml有一個相當可疑行爲的工作（a bug ？）在序列化時丟棄了doctype信息。這就是爲什麼你不能直接從selector.re得到它。

import re 
s = re.search("<!\s*doctype\s*(.*?)>", response.body, re.IGNORECASE) 
doctype = s.group(1) if s else ""

更新：
您可以通過直接利用re模塊上的response.body文字，這是正確的序列化克服這種小的障礙夠簡單

至於你的其他問題，其原因是下列。該行：

response.selector._root.getroottree().docinfo.doctype

返回一個string，而不是一個列表或類似的迭代器。因此，當你遍歷它時，你基本上遍歷了該字符串中的字母。例如，如果您的DOCTYPE爲<!DOCTYPE html>，則該字符串中有15個字符，這就是爲什麼您的循環迭代了15次。你可以驗證這樣的：

for sel in response.selector._root.getroottree().docinfo.doctype: 
    print sel

你應該讓你的DOCTYPE字符串打印每行一個字符。

你應該做的只是完全刪除for循環，只是沒有循環取數據。另外，如果您打算收集網站的網址，則應將其更改爲：item['website'] = response.url。所以其基本上：

def parse(self, response): 
    doctype = response.selector._root.getroottree().docinfo.doctype 
    item = DanishItem() 
    item['website'] = response.url 
    item['doctype'] = doctype 
    yield item

來源

2014-12-19 09:43:19 bosnjak

這很好用！只有我現在看不出來的東西是爲什麼我得到的響應寫入我的.json文件15次而不是一次，請參閱上面的編輯 – 2014-12-19 11:40:27

檢查我的更新答案。 – bosnjak 2014-12-19 12:07:02

謝謝您的詳細解釋！這很好:) – 2014-12-19 13:05:02

XPath/Scrapy刮DOCTYPE

回答

相關問題