Scrapy陷入了IIS 5.1頁面

我正在用scrapy編寫蜘蛛以從一些使用ASP的應用程序中獲取一些數據。這兩個網頁幾乎都是相同的，並且需要在開始報廢之前登錄，但我只設法取消其中一個。在另一個scrapy得到永遠等待的東西，永遠不會得到登錄後使用FormRequest方法。Scrapy陷入了IIS 5.1頁面

兩個蜘蛛的代碼（他們幾乎是，但不同的IP地址相同）是如下：

from scrapy.spider import BaseSpider 
from scrapy.http import FormRequest 
from scrapy.shell import inspect_response 

class MySpider(BaseSpider): 
name = "my_very_nice_spider" 
allowed_domains = ["xxx.xxx.xxx.xxx"] 
start_urls = ['http://xxx.xxx.xxx.xxx/reporting/'] 

def parse(self,response): 
    #Simulate user login on (http://xxx.xxx.xxx.xxx/reporting/) 
    return [FormRequest.from_response(response, 
             formdata={'user':'the_username', 
               'password':'my_nice_password'}, 
             callback=self.after_login)] 

def after_login(self,response): 
    inspect_response(response,self) #Spider never gets here in one site 
    if "Bad login" in response.body: 
     print "Login failed" 
     return 
    #Scrapping code begins...

想知道這可能是他們之間的不同我使用Firefox的活HTTP標頭檢查頭部，只發現一個區別是：工作的網頁使用IIS 6.0，而不是IIS 5.1。

由於僅此並不能解釋自己爲什麼一個工程和其他犯規」我用Wireshark來捕獲網絡流量，發現這個：

使用scrapy有工作網頁互動（IIS 6.0）

scrapy --> webpage GET /reporting/ HTTP/1.1 
scrapy <-- webpage HTTP/1.1 200 OK 
scrapy --> webpage POST /reporting/ HTTP/1.1 (application/x-www-form-urlencoded) 
scrapy <-- webpage HTTP/1.1 302 Object moved 
scrapy --> webpage GET /reporting/htm/webpage.asp 
scrapy <-- webpage HTTP/1.1 200 OK 
scrapy --> webpage POST /reporting/asp/report1.asp 
...Scrapping begins

使用scrapy與不工作的網頁

交互（IIS 5.1）

scrapy --> webpage GET /reporting/ HTTP/1.1 
scrapy <-- webpage HTTP/1.1 200 OK 
scrapy --> webpage POST /reporting/ HTTP/1.1 (application/x-www-form-urlencoded) 
scrapy <-- webpage HTTP/1.1 100 Continue # What the f...? 
scrapy <-- webpage HTTP/1.1 302 Object moved 
...Scrapy waits forever...

我GOOGLE了一點點，發現果然IIS 5.1有一些不錯的那種的「功能」，使得當有人向其發送POST時返回HTTP 100 as shown here。

瞭解所有邪惡的根源始終在哪裏，但不得不廢除該網站...我怎樣才能使scrapy在這種情況下工作？或者我做錯了什麼？

謝謝！

編輯 - 控制檯登錄與不工作的網站：

2014-01-17 09:09:50-0300 [scrapy] INFO: Scrapy 0.20.2 started (bot: mybot) 
2014-01-17 09:09:50-0300 [scrapy] DEBUG: Optional features available: ssl, http11 
2014-01-17 09:09:50-0300 [scrapy] DEBUG: Overridden settings: {'NEWSPIDER_MODULE': bot.spiders', 'SPIDER_MODULES': [bot.spiders'], 'BOT_NAME': 'bot'} 
2014-01-17 09:09:51-0300 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState 
2014-01-17 09:09:51-0300 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats 
2014-01-17 09:09:51-0300 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware 
2014-01-17 09:09:51-0300 [scrapy] DEBUG: Enabled item pipelines: 
2014-01-17 09:09:51-0300 [my_very_nice_spider] INFO: Spider opened 
2014-01-17 09:09:51-0300 [my_very_nice_spider] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 
2014-01-17 09:09:51-0300 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023 
2014-01-17 09:09:51-0300 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080 
2014-01-17 09:09:54-0300 [my_very_nice_spider] DEBUG: Crawled (200) <GET http://xxx.xxx.xxx.xxx/reporting/> (referer: None) 
2014-01-17 09:10:51-0300 [my_very_nice_spider] INFO: Crawled 1 pages (at 1 pages/min), scraped 0 items (at 0 items/min) 
2014-01-17 09:11:51-0300 [my_very_nice_spider] INFO: Crawled 1 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 
2014-01-17 09:12:51-0300 [my_very_nice_spider] INFO: Crawled 1 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 
2014-01-17 09:12:54-0300 [my_very_nice_spider] DEBUG: Retrying <POST http://xxx.xxx.xxx.xxx/reporting/> (failed 1 times): User timeout caused connection failure: Getting http://xxx.xxx.xxx.xxx/reporting/ took longer than 180 seconds.. 
2014-01-17 09:13:51-0300 [my_very_nice_spider] INFO: Crawled 1 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 
2014-01-17 09:14:51-0300 [my_very_nice_spider] INFO: Crawled 1 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 
2014-01-17 09:15:51-0300 [my_very_nice_spider] INFO: Crawled 1 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 
2014-01-17 09:15:54-0300 [my_very_nice_spider] DEBUG: Retrying <POST http://xxx.xxx.xxx.xxx/reporting/> (failed 2 times): User timeout caused connection failure: Getting http://xxx.xxx.xxx.xxx/reporting/ took longer than 180 seconds.. 
...

來源

2014-01-17 llekn

你可以分享你的控制檯日誌？ –

剛剛添加控制檯日誌...你還需要什麼嗎？ – llekn

現在你有超時？ POST URL與工作案例相同嗎？日誌顯示「POST http：// xxx.xxx.xxx.xxx/reporting /'與POST/reporting/index.asp相比較（可能是由於您在發佈前重寫日誌）。既然你使用Wireshark，你能看到標題中有什麼不同嗎？您是否將查詢與瀏覽器的功能進行了比較？ –

嘗試使用HTTP 1.0下載：

# settings.py 
DOWNLOAD_HANDLERS = { 
    'http': 'scrapy.core.downloader.handlers.http10.HTTP10DownloadHandler', 
    'https': 'scrapy.core.downloader.handlers.http10.HTTP10DownloadHandler', 
}

來源

2014-01-17 13:15:52 Rolando

謝謝！作品完美無瑕！我會牢記在廢除IIS 5.1服務器 – llekn

@kmixflow我建議將此問題添加到https://github.com/scrapy/scrapy/issues，因爲默認處理程序它應該工作得很好。 – Rolando

非常感謝@Rho，我會在那裏提交這個問題......我懷疑沒有理由能夠僅取消其中一個應用程序。 – llekn

Scrapy陷入了IIS 5.1頁面

回答

相關問題