Scrapy雅虎集團蜘蛛

試圖刮Y！小組和我可以從一個頁面獲取數據，但就是這樣。我有一些基本的規則，但顯然他們是不正確的。任何人已經解決了這個問題Scrapy雅虎集團蜘蛛

class YgroupSpider(CrawlSpider): 
name = "yahoo.com" 
allowed_domains = ["launch.groups.yahoo.com"] 
start_urls = [ 
    "http://launch.groups.yahoo.com/group/random_public_ygroup/post" 
] 

rules = (
    Rule(SgmlLinkExtractor(allow=('message','messages'), deny=('mygroups',))), 
    Rule(SgmlLinkExtractor(), callback='parse_item'), 
) 


def parse_item(self, response): 
    hxs = HtmlXPathSelector(response) 
    sites = hxs.select('/html') 
    item = Item() 
    for site in sites: 
     item = YgroupItem() 
     item['title'] = site.select('//title').extract() 
     item['pubDate'] = site.select('//abbr[@class="updated"]/text()').extract() 
     item['desc'] = site.select("//div[contains(concat(' ',normalize-space(@class),' '),' entry-content ')]/text()").extract() 
    return item

來源

2011-03-23 linkingarts

看起來你幾乎不知道你在做什麼。我對Scrapy相當陌生，但我想你會想要類似 Rule(SgmlLinkExtractor(allow=('http\://example\.com/message/.*\.aspx',)), callback='parse_item'), 嘗試編寫一個正則表達式，以匹配您想要的完整鏈接URL。另外，看起來你只需要一條規則。將回調添加到第一個。鏈接提取器匹配與允許中的正則表達式匹配的每個鏈接，並且排除那些與拒絕匹配的鏈接，並且從那裏將剩餘的每個頁面加載並傳遞到parse_item。

我在說這些都不知道關於數據挖掘的頁面以及所需數據的性質。你想要這種蜘蛛的頁面，該頁面鏈接到有你想要的數據的頁面。

來源

2011-03-27 01:57:34 Muhd

不錯，謝謝。我可能應該用exoanded來說，我想要groupname/message/1，groupname/message/2等（它們是來自/ post？id = averylongidstringthat的其他別名，不能用於消息1或2 – linkingarts 2011-03-27 04:07:17

Scrapy雅虎集團蜘蛛

回答

相關問題