如何同時抓取和抓取數據？

這是我第一次使用網絡抓取的經驗，我不知道我是否做得好。關鍵是我想同時抓取和抓取數據。如何同時抓取和抓取數據？

得到所有我會刮掉
商店他們到MongoDB的

訪問逐一刮其內容

# Crawling: get all links to be scrapped later on 
class LinkCrawler(Spider): 
    name="link" 
    allowed_domains = ["website.com"] 
    start_urls = ["https://www.website.com/offres?start=%s" % start for start in xrange(0,10000,20)] 
    def parse(self,response): 
     # loop for all pages 
     next_page = Selector(response).xpath('//li[@class="active"]/following-sibling::li[1]/a/@href').extract() 

     if not not next_page: 
      yield Request("https://"+next_page[0], callback = self.parse) 

     # loop for all links in a single page 
     links = Selector(response).xpath('//div[@class="row-fluid job-details pointer"]/div[@class="bloc-right"]/div[@class="row-fluid"]') 

     for link in links: 
      item = Link() 
      url = response.urljoin(link.xpath('a/@href')[0].extract()) 
      item['url'] = url 
      items.append(item) 

     for item in items: 
      yield item 

# Scraping: get all the stored links on MongoDB and scrape them????

來源

2017-07-13 geek-tech

究竟什麼是你的用例？您是否主要對其導致的頁面的鏈接或內容感興趣？即是否有任何理由先將這些鏈接存儲在MongoDB中，然後再刪除頁面？如果您確實需要在MongoDB中存儲鏈接，最好使用item pipeline來存儲這些項目。在鏈接中，甚至還有在MongoDB中存儲項目的例子。如果你需要更復雜的東西，看看scrapy-mongodb包。

除此之外，還有對您發佈的實際代碼一些意見：

而不是Selector(response).xpath(...)使用只是response.xpath(...)。
如果您只需要選擇器中第一個提取的元素，請使用extract_first()而不是使用extract()和索引。
請勿使用if not not next_page:，請使用if next_page:。
不需要items的第二個循環，yield循環中的項目需要links。

來源

2017-07-13 09:21:38

嘿，非常感謝。我在刮的網站是電子商務網站，人們出售物品，一旦出售，他們將其刪除。因此，爲了讓我知道哪些產品銷售得很快，我認爲我必須保存鏈接，以便稍後檢查是否刪除或不刪除。另外，如果有可能在mongodb上存儲該鏈接之前刮取每個鏈接的內容，請告訴我該怎麼做？ –

如果指向個別產品的鏈接遵循一些常見模式，則最好使用['CrawlSpider']（https://doc.scrapy.org/en/latest/topics/spiders.html#crawlspider）和適當的規則。 –

是的個別產品，但有一個tuto在那裏？我想訪問每一個鏈接，並提取在那裏暴露的數據... –

如何同時抓取和抓取數據？

回答

相關問題