2014-01-17 34 views
0

我正在用scrapy編寫蜘蛛以從一些使用ASP的應用程序中獲取一些數據。這兩個網頁幾乎都是相同的,並且需要在開始報廢之前登錄,但我只設法取消其中一個。在另一個scrapy得到永遠等待的東西,永遠不會得到登錄後使用FormRequest方法。Scrapy陷入了IIS 5.1頁面

兩個蜘蛛的代碼(他們幾乎是,但不同的IP地址相同)是如下:

from scrapy.spider import BaseSpider 
from scrapy.http import FormRequest 
from scrapy.shell import inspect_response 

class MySpider(BaseSpider): 
name = "my_very_nice_spider" 
allowed_domains = ["xxx.xxx.xxx.xxx"] 
start_urls = ['http://xxx.xxx.xxx.xxx/reporting/'] 

def parse(self,response): 
    #Simulate user login on (http://xxx.xxx.xxx.xxx/reporting/) 
    return [FormRequest.from_response(response, 
             formdata={'user':'the_username', 
               'password':'my_nice_password'}, 
             callback=self.after_login)] 

def after_login(self,response): 
    inspect_response(response,self) #Spider never gets here in one site 
    if "Bad login" in response.body: 
     print "Login failed" 
     return 
    #Scrapping code begins... 

想知道這可能是他們之間的不同我使用Firefox的活HTTP標頭檢查頭部,只發現一個區別是:工作的網頁使用IIS 6.0,而不是IIS 5.1。

由於僅此並不能解釋自己爲什麼一個工程和其他犯規」我用Wireshark來捕獲網絡流量,發現這個:

使用scrapy有工作網頁互動(IIS 6.0)

scrapy --> webpage GET /reporting/ HTTP/1.1 
scrapy <-- webpage HTTP/1.1 200 OK 
scrapy --> webpage POST /reporting/ HTTP/1.1 (application/x-www-form-urlencoded) 
scrapy <-- webpage HTTP/1.1 302 Object moved 
scrapy --> webpage GET /reporting/htm/webpage.asp 
scrapy <-- webpage HTTP/1.1 200 OK 
scrapy --> webpage POST /reporting/asp/report1.asp 
...Scrapping begins 
使用scrapy與不工作的網頁

交互(IIS 5.1)

scrapy --> webpage GET /reporting/ HTTP/1.1 
scrapy <-- webpage HTTP/1.1 200 OK 
scrapy --> webpage POST /reporting/ HTTP/1.1 (application/x-www-form-urlencoded) 
scrapy <-- webpage HTTP/1.1 100 Continue # What the f...? 
scrapy <-- webpage HTTP/1.1 302 Object moved 
...Scrapy waits forever... 

我GOOGLE了一點點,發現果然IIS 5.1有一些不錯的那種的「功能」,使得當有人向其發送POST時返回HTTP 100 as shown here

瞭解所有邪惡的根源始終在哪裏,但不得不廢除該網站...我怎樣才能使scrapy在這種情況下工作?或者我做錯了什麼?

謝謝!

編輯 - 控制檯登錄與不工作的網站:

2014-01-17 09:09:50-0300 [scrapy] INFO: Scrapy 0.20.2 started (bot: mybot) 
2014-01-17 09:09:50-0300 [scrapy] DEBUG: Optional features available: ssl, http11 
2014-01-17 09:09:50-0300 [scrapy] DEBUG: Overridden settings: {'NEWSPIDER_MODULE': bot.spiders', 'SPIDER_MODULES': [bot.spiders'], 'BOT_NAME': 'bot'} 
2014-01-17 09:09:51-0300 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState 
2014-01-17 09:09:51-0300 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats 
2014-01-17 09:09:51-0300 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware 
2014-01-17 09:09:51-0300 [scrapy] DEBUG: Enabled item pipelines: 
2014-01-17 09:09:51-0300 [my_very_nice_spider] INFO: Spider opened 
2014-01-17 09:09:51-0300 [my_very_nice_spider] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 
2014-01-17 09:09:51-0300 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023 
2014-01-17 09:09:51-0300 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080 
2014-01-17 09:09:54-0300 [my_very_nice_spider] DEBUG: Crawled (200) <GET http://xxx.xxx.xxx.xxx/reporting/> (referer: None) 
2014-01-17 09:10:51-0300 [my_very_nice_spider] INFO: Crawled 1 pages (at 1 pages/min), scraped 0 items (at 0 items/min) 
2014-01-17 09:11:51-0300 [my_very_nice_spider] INFO: Crawled 1 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 
2014-01-17 09:12:51-0300 [my_very_nice_spider] INFO: Crawled 1 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 
2014-01-17 09:12:54-0300 [my_very_nice_spider] DEBUG: Retrying <POST http://xxx.xxx.xxx.xxx/reporting/> (failed 1 times): User timeout caused connection failure: Getting http://xxx.xxx.xxx.xxx/reporting/ took longer than 180 seconds.. 
2014-01-17 09:13:51-0300 [my_very_nice_spider] INFO: Crawled 1 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 
2014-01-17 09:14:51-0300 [my_very_nice_spider] INFO: Crawled 1 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 
2014-01-17 09:15:51-0300 [my_very_nice_spider] INFO: Crawled 1 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 
2014-01-17 09:15:54-0300 [my_very_nice_spider] DEBUG: Retrying <POST http://xxx.xxx.xxx.xxx/reporting/> (failed 2 times): User timeout caused connection failure: Getting http://xxx.xxx.xxx.xxx/reporting/ took longer than 180 seconds.. 
... 
+0

你可以分享你的控制檯日誌? –

+0

剛剛添加控制檯日誌...你還需要什麼嗎? – llekn

+0

現在你有超時? POST URL與工作案例相同嗎?日誌顯示「POST http:// xxx.xxx.xxx.xxx/reporting /'與POST/reporting/index.asp相比較(可能是由於您在發佈前重寫日誌)。 既然你使用Wireshark,你能看到標題中有什麼不同嗎?您是否將查詢與瀏覽器的功能進行了比較? –

回答

1

嘗試使用HTTP 1.0下載:

# settings.py 
DOWNLOAD_HANDLERS = { 
    'http': 'scrapy.core.downloader.handlers.http10.HTTP10DownloadHandler', 
    'https': 'scrapy.core.downloader.handlers.http10.HTTP10DownloadHandler', 
} 
+0

謝謝!作品完美無瑕!我會牢記在廢除IIS 5.1服務器 – llekn

+0

@kmixflow我建議將此問題添加到https://github.com/scrapy/scrapy/issues,因爲默認處理程序它應該工作得很好。 – Rolando

+0

非常感謝@Rho,我會在那裏提交這個問題......我懷疑沒有理由能夠僅取消其中一個應用程序。 – llekn