2010-07-16 74 views
5

我正在尋找Scrapy中SQLite管道的一些示例代碼。我知道沒有內置的支持,但我相信它已經完成。只有實際的代碼可以幫助我,因爲我只知道足夠的Python和Scrapy來完成我非常有限的任務,並且需要代碼作爲起點。有沒有人有Scrapy中的sqlite管道的示例代碼?

+0

不是異步運行的Scrapy需要非阻塞數據存儲嗎?在這種情況下,SQLite將無法工作? – zelusp 2016-10-06 18:24:24

+0

似乎sqlite3足夠快速並且足夠聰明地處理併發(最多一點)。 [see here](http://stackoverflow.com/questions/4060772/sqlite3-concurrent-access) – zelusp 2016-10-08 00:19:25

回答

9

我做了這樣的事情:

# 
# Author: Jay Vaughan 
# 
# Pipelines for processing items returned from a scrape. 
# Dont forget to add pipeline to the ITEM_PIPELINES setting 
# See: http://doc.scrapy.org/topics/item-pipeline.html 
# 
from scrapy import log 
from pysqlite2 import dbapi2 as sqlite 

# This pipeline takes the Item and stuffs it into scrapedata.db 
class scrapeDatasqLitePipeline(object): 
    def __init__(self): 
     # Possible we should be doing this in spider_open instead, but okay 
     self.connection = sqlite.connect('./scrapedata.db') 
     self.cursor = self.connection.cursor() 
     self.cursor.execute('CREATE TABLE IF NOT EXISTS myscrapedata ' \ 
        '(id INTEGER PRIMARY KEY, url VARCHAR(80), desc VARCHAR(80))') 

    # Take the item and put it in database - do not allow duplicates 
    def process_item(self, item, spider): 
     self.cursor.execute("select * from myscrapedata where url=?", item['url']) 
     result = self.cursor.fetchone() 
     if result: 
      log.msg("Item already in database: %s" % item, level=log.DEBUG) 
     else: 
      self.cursor.execute(
       "insert into myscrapedata (url, desc) values (?, ?)", 
        (item['url'][0], item['desc'][0]) 

      self.connection.commit() 

      log.msg("Item stored : " % item, level=log.DEBUG) 
     return item 

    def handle_error(self, e): 
     log.err(e) 
1

對於任何試圖解決類似問題的人來說,我只是遇到了一個不錯的Sqlite Item Exproter for SQLite:https://github.com/RockyZ/Scrapy-sqlite-item-exporter

scrapy crawl <spider name> -o sqlite.db -t sqlite 

它也可以適於用作項目管道的替代產品出口商:

它包括到您的項目設置後,你可以使用它。

4

這是一個sqlalchemy的sqlite管道。 使用sqlalchemy,您可以根據需要輕鬆更改數據庫。

settings.py附加數據庫配置

# settings.py 
# ... 
DATABASE = { 
    'drivername': 'sqlite', 
    # 'host': 'localhost', 
    # 'port': '5432', 
    # 'username': 'YOUR_USERNAME', 
    # 'password': 'YOUR_PASSWORD', 
    'database': 'books.sqlite' 
} 

然後在pipelines.py添加以下

# pipelines.py 
import logging 

from scrapy import signals 
from sqlalchemy import Column, Integer, String, DateTime 
from sqlalchemy import create_engine 
from sqlalchemy.engine.url import URL 
from sqlalchemy.ext.declarative import declarative_base 
from sqlalchemy.orm import sessionmaker 
from sqlalchemy.pool import NullPool 

logger = logging.getLogger(__name__) 

DeclarativeBase = declarative_base() 

class Book(DeclarativeBase): 
    __tablename__ = "books" 

    id = Column(Integer, primary_key=True) 
    title = Column('title', String) 
    author = Column('author', String) 
    publisher = Column('publisher', String) 
    url = Column('url', String) 
    scrape_date = Column('scrape_date', DateTime) 

    def __repr__(self): 
     return "<Book({})>".format(self.url) 


class SqlitePipeline(object): 
    def __init__(self, settings): 
     self.database = settings.get('DATABASE') 
     self.sessions = {} 

    @classmethod 
    def from_crawler(cls, crawler): 
     pipeline = cls(crawler.settings) 
     crawler.signals.connect(pipeline.spider_opened, signals.spider_opened) 
     crawler.signals.connect(pipeline.spider_closed, signals.spider_closed) 
     return pipeline 

    def create_engine(self): 
     engine = create_engine(URL(**self.database), poolclass=NullPool, connect_args = {'charset':'utf8'}) 
     return engine 

    def create_tables(self, engine): 
     DeclarativeBase.metadata.create_all(engine, checkfirst=True) 

    def create_session(self, engine): 
     session = sessionmaker(bind=engine)() 
     return session 

    def spider_opened(self, spider): 
     engine = self.create_engine() 
     self.create_tables(engine) 
     session = self.create_session(engine) 
     self.sessions[spider] = session 

    def spider_closed(self, spider): 
     session = self.sessions.pop(spider) 
     session.close() 

    def process_item(self, item, spider): 
     session = self.sessions[spider] 
     book = Book(**item) 
     link_exists = session.query(Book).filter_by(url=item['url']).first() is not None 

     if link_exists: 
      logger.info('Item {} is in db'.format(book)) 
      return item 

     try: 
      session.add(book) 
      session.commit() 
      logger.info('Item {} stored in db'.format(book)) 
     except: 
      logger.info('Failed to add {} to db'.format(book)) 
      session.rollback() 
      raise 

     return item 

items.py應該是這樣的

#items.py 
import scrapy 

class BookItem(scrapy.Item): 
    title = scrapy.Field() 
    author = scrapy.Field() 
    publisher = scrapy.Field() 
    scrape_date = scrapy.Field() 

您也可以考慮移動class Book into items.py

相關問題