2017-06-05 224 views
0

合併輸出我有一個Scrapy輸出是這樣的:Scrapy在現場

[{'gender': 'women', 
    'name': 'NEW IN: CLOTHING', 
    'products': [{'name': 'Free People Cocoon Multi Way Neck Top', 
       'price': {'currency': 'GBP', 
          'outlet': '40.0', 
          'retail': '58.0'}}]}, 
{'gender': 'women', 
    'name': 'NEW IN: CLOTHING', 
    'products': [{'name': 'Free People Cocoon Multi Way Neck Top', 
       'price': {'currency': 'GBP', 
          'outlet': '40.0', 
          'retail': '58.0'}}, 
       {'name': 'N12H Joshua Tree Dress', 
       'price': {'currency': 'GBP', 
          'outlet': '140.0', 
          'retail': '249.0'}}]}, 
{'gender': 'women', 
    'name': 'NEW IN: CLOTHING', 
    'products': [{'name': 'Free People Cocoon Multi Way Neck Top', 
       'price': {'currency': 'GBP', 
          'outlet': '40.0', 
          'retail': '58.0'}}, 
       {'name': 'N12H Joshua Tree Dress', 
       'price': {'currency': 'GBP', 
          'outlet': '140.0', 
          'retail': '249.0'}}, 
       {'name': 'Twiin Method Rib Mesh Flare Sleeve Top', 
       'price': {'currency': 'GBP', 
          'outlet': '22.0', 
          'retail': '32.0'}}]}, 
{'gender': 'women', 
    'name': 'NEW IN: CLOTHING', 
    'products': [{'name': 'Free People Cocoon Multi Way Neck Top', 
       'price': {'currency': 'GBP', 
          'outlet': '40.0', 
          'retail': '58.0'}}, 
       {'name': 'N12H Joshua Tree Dress', 
       'price': {'currency': 'GBP', 
          'outlet': '140.0', 
          'retail': '249.0'}}, 
       {'name': 'Twiin Method Rib Mesh Flare Sleeve Top', 
       'price': {'currency': 'GBP', 
          'outlet': '22.0', 
          'retail': '32.0'}}, 
       {'name': 'Twiin End Game Varsity Denim Trucker Jacket', 
       'price': {'currency': 'GBP', 
          'outlet': '45.0', 
          'retail': '80.0'}}]}, 
{'gender': 'women', 
    'name': 'NEW IN: SHOES & ACCESSORIES ', 
    'products': [{'name': 'Melissa Ultragirl Triple Bow Ballerina', 
       'price': {'currency': 'GBP', 
          'outlet': '48.0', 
          'retail': '68.0'}}]}, 
{'gender': 'women', 
    'name': 'NEW IN: SHOES & ACCESSORIES ', 
    'products': [{'name': 'Melissa Ultragirl Triple Bow Ballerina', 
       'price': {'currency': 'GBP', 
          'outlet': '48.0', 
          'retail': '68.0'}}, 
       {'name': 'Zaxy Tbar Flip Flops', 
       'price': {'currency': 'GBP', 
          'outlet': '20.0', 
          'retail': '26.0'}}]}, 
{'gender': 'women', 
    'name': 'NEW IN: SHOES & ACCESSORIES ', 
    'products': [{'name': 'Melissa Ultragirl Triple Bow Ballerina', 
       'price': {'currency': 'GBP', 
          'outlet': '48.0', 
          'retail': '68.0'}}, 
       {'name': 'Zaxy Tbar Flip Flops', 
       'price': {'currency': 'GBP', 
          'outlet': '20.0', 
          'retail': '26.0'}}, 
       {'name': 'Estella Bartlet Silver Plated Heart Bracelet Duo Set', 
       'price': {'currency': 'GBP', 
          'outlet': '15.0', 
          'retail': '31.0'}}]}, 
{'gender': 'women', 
    'name': 'NEW IN: SHOES & ACCESSORIES ', 
    'products': [{'name': 'Melissa Ultragirl Triple Bow Ballerina', 
       'price': {'currency': 'GBP', 
          'outlet': '48.0', 
          'retail': '68.0'}}, 
       {'name': 'Zaxy Tbar Flip Flops', 
       'price': {'currency': 'GBP', 
          'outlet': '20.0', 
          'retail': '26.0'}}, 
       {'name': 'Estella Bartlet Silver Plated Heart Bracelet Duo Set', 
       'price': {'currency': 'GBP', 
          'outlet': '15.0', 
          'retail': '31.0'}}, 
       {'name': 'Ashiana Embroidered Large Toiletry Bag With Wateproof ' 
         'Lining', 
       'price': {'currency': 'GBP', 
          'outlet': '25.0', 
          'retail': '35.0'}}]}] 

這是因爲在每一個產品的加工我使用Loader.load_item()。

如何建立一個管道或輸出處理器,使其只返回最後處理項目,像下面?

[{'gender': 'women', 
    'name': 'NEW IN: CLOTHING', 
    'products': [{'name': 'Free People Cocoon Multi Way Neck Top', 
       'price': {'currency': 'GBP', 
          'outlet': '40.0', 
          'retail': '58.0'}}, 
       {'name': 'N12H Joshua Tree Dress', 
       'price': {'currency': 'GBP', 
          'outlet': '140.0', 
          'retail': '249.0'}}, 
       {'name': 'Twiin Method Rib Mesh Flare Sleeve Top', 
       'price': {'currency': 'GBP', 
          'outlet': '22.0', 
          'retail': '32.0'}}, 
       {'name': 'Twiin End Game Varsity Denim Trucker Jacket', 
       'price': {'currency': 'GBP', 
          'outlet': '45.0', 
          'retail': '80.0'}}]}, 
{'gender': 'women', 
    'name': 'NEW IN: SHOES & ACCESSORIES ', 
    'products': [{'name': 'Melissa Ultragirl Triple Bow Ballerina', 
       'price': {'currency': 'GBP', 
          'outlet': '48.0', 
          'retail': '68.0'}}, 
       {'name': 'Zaxy Tbar Flip Flops', 
       'price': {'currency': 'GBP', 
          'outlet': '20.0', 
          'retail': '26.0'}}, 
       {'name': 'Estella Bartlet Silver Plated Heart Bracelet Duo Set', 
       'price': {'currency': 'GBP', 
          'outlet': '15.0', 
          'retail': '31.0'}}, 
       {'name': 'Ashiana Embroidered Large Toiletry Bag With Wateproof ' 
         'Lining', 
       'price': {'currency': 'GBP', 
          'outlet': '25.0', 
          'retail': '35.0'}}]}] 

處理的最後一行包含該會話中的所有產品。我在蜘蛛關閉時嘗試處理,但沒有成功。

我即將結束這個項目,研究了很多,並試圖很多事情,很多問題,但沒有涉及到物品堆放在現場。

我的項目代碼:

from scrapy.item import Item, Field 
from scrapy.loader.processors import TakeFirst, Join, Compose, MapCompose 


class Session(Item): 
    name = Field() 
    gender = Field() 
    products = Field(
     # no idea what to put... tryed Join, Compose and MapCompose 
    ) 


class Product(Item): 
    name = Field() 
    price = Field() 


class Price(Item): 
    outlet = Field() 
    retail = Field() 
    currency = Field() 

我的蜘蛛代碼:

def parse(self, response): 
    sessions = response.css("article.feature:nth-of-type(-n+2)") 
    for session in sessions: 
     sessionlink = session.css("a.feature__link::attr(href)").extract_first() 

     lsession = ItemLoader(item=Session(), response=response) 
     lsession.add_value("name", session.css("div.feature__title h3::text").extract_first()) 
     lsession.add_value("gender", re.split("[/]+", response.request.url)[2]) 

     requestsession = response.follow(sessionlink, callback=self.parse_session) 
     requestsession.meta["lsession"] = lsession 
     requestsession.meta["pages"] = 1 
     yield requestsession 

def parse_session(self, response): 
    lsession = response.meta["lsession"] 
    pages = response.meta["pages"] 

    products = response.css("li.product-container:nth-of-type(-n+2)") 

    for product in products: 
     productlink = product.css("a.product-link::attr(href)").extract_first() 
     requestproduct = response.follow(productlink, callback=self.parse_product) 
     requestproduct.meta["lsession"] = lsession 
     requestproduct.meta["productlink"] = productlink 
     yield requestproduct 

    nextpage = response.css("ul.pager li.next a::attr(href)").extract_first() 
    if pages < 2: 
     pages += 1 
     requestnewpage = response.follow(nextpage, callback=self.parse_session) 
     requestnewpage.meta["lsession"] = lsession 
     requestnewpage.meta["pages"] = pages 
     yield requestnewpage 

def parse_product(self, response): 
    lsession = response.meta["lsession"] 
    productlink = response.meta["productlink"] 

    lproduct = ItemLoader(item=Product(), response=response) 

    name = response.css("div.product-hero>h1::text").extract_first() 

    lproduct.replace_value("name", str(name)) 

    pricelink = "AN AJAX LINK TO GET THE PRICE" 

    requestprice = response.follow(pricelink, callback=self.parse_price) 
    requestprice.meta["lsession"] = lsession 
    requestprice.meta["lproduct"] = lproduct 

    yield requestprice 

def parse_price(self, response): 
    lsession = response.meta["lsession"] 
    lproduct = response.meta["lproduct"] 

    lprice = ItemLoader(item=Price(), response=response) 

    pricejson = json.loads(response.body) 
    outletprice = pricejson[0]["productPrice"]["current"]["value"] 
    retailprice = pricejson[0]["productPrice"]["rrp"]["value"] 
    currency = pricejson[0]["productPrice"]["currency"] 

    lprice.replace_value("outlet", str(outletprice)) 
    lprice.replace_value("retail", str(retailprice)) 
    lprice.replace_value("currency", str(currency)) 
    lproduct.replace_value("price", lprice.load_item()) 
    lsession.add_value("products", dict(lproduct.load_item())) 

    yield lsession.load_item() 

回答

0

亞塔!記得我的上學時間,我記錄了關閉。 我不知道python有這種功能行爲。我是這個語言的初學者。

所以,因爲我得到了很多的幫助,在這裏,我要在這裏發佈我的解決方案,因此,如果需要其他人可以得到幫助。

我了這樣的一個閉合計數器(只是一個基本的一個):

def counter(): 
    value = 0 
    def count(op): 
     nonlocal value 
     if op == "add": 
      value += 1 
     elif op == "sub": 
      value -= 1 
     elif op == "get": 
      return value 

    return count 

然後,我開始爲每個部分的計數器:

requestsession = response.follow(sessionlink, callback=self.parse_session) 
requestsession.meta["lsession"] = lsession 
requestsession.meta["pcounter"] = counter() 
requestsession.meta["pages"] = 1 

當處理每個產品,我向上計數,並繼續通過計數器,直到價格處理:

for product in products: 
    pcounter("add") 
    productlink = product.css("a.product-link::attr(href)").extract_first() 
    requestproduct = response.follow(productlink, callback=self.parse_product) 
    requestproduct.meta["lsession"] = lsession 
    requestproduct.meta["pcounter"] = pcounter 
    requestproduct.meta["productlink"] = productlink 
    yield requestproduct 

價格分析後,我倒計時,當我加載「lsession」項目裝載機,我檢查所有的產品進​​行了處理:

pcounter("sub") 

if pcounter("get") == 0: 
    yield lsession.load_item() 

希望這將是有用的人。