2016-12-07 84 views
1

所以我最近剛剛開始嘗試Scrapy的一個項目,並且我對各種較舊的語法(SgmlLinkExtractor等)感到非常困惑,但我以某種方式設法將我認爲是對我而言有意義的易讀代碼放在一起。但是,這不會遍歷網站中的每個頁面,而只會轉到start_urls頁面並且不會生成輸出文件。有人能解釋我錯過了什麼嗎?Scrapy基本爬蟲不工作?

import scrapy 
import csv 
from scrapy.spiders import CrawlSpider, Rule 
from scrapy.linkextractors import LinkExtractor 

class RLSpider(CrawlSpider): 
    name = "RL" 
    allowed_domains='ralphlauren.com/product/' 
    start_urls=[ 
     'http://www.ralphlauren.com/' 
    ] 
    rules = (
     Rule(LinkExtractor(),callback="parse_item",follow=True), 
    ) 

    def parse_item(self, response): 
     name = response.xpath('//h1/text()').extract_first() 
     price = response.xpath('//span[@class="reg-price"]/span/text()').extract_first() 
     image=response.xpath('//input[@name="enh_0"]/@value').extract_first() 
     print("Rules=",rules) 
     tup=(name,price,image) 
     csvF=open('data.csv','w') 
     csvWrite = csv.writer(csvF) 
     csvWrite.writerow(tup) 
     return [] 
    def parse(self,response): 
     pass 

我試圖從該網站提取數據,並將其寫入所有頁面的CSV文件在/產品未來/

下面是日誌:

2016-12-07 19:46:49 [scrapy] INFO: Scrapy 1.2.2 started (bot: P35Crawler) 
2016-12-07 19:46:49 [scrapy] INFO: Overridden settings: {'BOT_NAME': 'P35Crawler 
', 'NEWSPIDER_MODULE': 'P35Crawler.spiders', 'SPIDER_MODULES': ['P35Crawler.spid 
ers']} 
2016-12-07 19:46:49 [scrapy] INFO: Enabled extensions: 
['scrapy.extensions.logstats.LogStats', 
'scrapy.extensions.telnet.TelnetConsole', 
'scrapy.extensions.corestats.CoreStats'] 
2016-12-07 19:46:50 [scrapy] INFO: Enabled downloader middlewares: 
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 
'scrapy.downloadermiddlewares.retry.RetryMiddleware', 
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 
'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware', 
'scrapy.downloadermiddlewares.stats.DownloaderStats'] 
2016-12-07 19:46:50 [scrapy] INFO: Enabled spider middlewares: 
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 
'scrapy.spidermiddlewares.referer.RefererMiddleware', 
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 
'scrapy.spidermiddlewares.depth.DepthMiddleware'] 
2016-12-07 19:46:50 [scrapy] INFO: Enabled item pipelines: 
[] 
2016-12-07 19:46:50 [scrapy] INFO: Spider opened 
2016-12-07 19:46:50 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 i 
tems (at 0 items/min) 
2016-12-07 19:46:50 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023 
2016-12-07 19:46:51 [scrapy] DEBUG: Redirecting (302) to <GET http://www.ralphla 
uren.com/home/index.jsp?ab=Geo_iIN_rUS_dUS> from <GET http://www.ralphlauren.com 
/> 
2016-12-07 19:46:51 [scrapy] DEBUG: Crawled (200) <GET http://www.ralphlauren.co 
m/home/index.jsp?ab=Geo_iIN_rUS_dUS> (referer: None) 
2016-12-07 19:46:51 [scrapy] INFO: Closing spider (finished) 
2016-12-07 19:46:51 [scrapy] INFO: Dumping Scrapy stats: 
{'downloader/request_bytes': 497, 
'downloader/request_count': 2, 
'downloader/request_method_count/GET': 2, 
'downloader/response_bytes': 20766, 
'downloader/response_count': 2, 
'downloader/response_status_count/200': 1, 
'downloader/response_status_count/302': 1, 
'finish_reason': 'finished', 
'finish_time': datetime.datetime(2016, 12, 7, 14, 16, 51, 973406), 
'log_count/DEBUG': 3, 
'log_count/INFO': 7, 
'response_received_count': 1, 
'scheduler/dequeued': 2, 
'scheduler/dequeued/memory': 2, 
'scheduler/enqueued': 2, 
'scheduler/enqueued/memory': 2, 
'start_time': datetime.datetime(2016, 12, 7, 14, 16, 50, 287464)} 
2016-12-07 19:46:51 [scrapy] INFO: Spider closed (finished) 
+0

檢查您的日誌,我假設網址因'allowed_domains'被過濾,請刪除它。 – eLRuLL

+0

@eLRuLL嗨,謝謝你的回覆。發佈日誌。試圖評論allowed_domains,但仍然沒有奏效。 –

回答

0

你不應該用空的覆蓋parse()方法。所以只要刪除該方法的聲明即可。請讓我知道這可不可以幫你。

UPDATE

關於對解析JSscrapy您的評論,也有不同的方法來做到這一點。你需要一個瀏覽器來解析JS。假設您想嘗試Firefox並使用Selenium來控制它。

國際海事組織是實施下載處理程序的最佳方式,正如我在this answer上解釋的。否則,您可以執行downloader middleware,如here所述。該middleware相比於handler一些缺點,如download handler將允許您使用默認cacheretry

後你得到Firefox工作的基本腳本,然後你可以通過改變切換到PhantomJS只是一個幾行。 PhantomJS是一個無頭瀏覽器,這意味着它不需要加載所有的瀏覽器界面。所以速度要快得多。

其他解決方案包括使用DockerSplash,但我最終認爲這是一個過度殺傷,因爲你需要運行一個​​只是爲了控制瀏覽器。

綜上所述,最好的解決方案是實現download handler,它使用SeleniumPhantomJS

+0

嗨,謝謝你的工作,但現在allowed_domains僅限於ralphlauren.com,所有內容都被濾除,仍然只能抓取第一個網站。 –

+0

好的,我注意到我需要的大部分鏈接都位於網站的JavaScript部分,我如何抓取這些鏈接? –