2013-02-27 211 views
1

url的模式是
http://www.khmer24.com/ad/change-petrol-to-gas-use-injector-special-price/67-204320.html
我想在域中保留域,廣告和數字67。下面是示例網址:
http://www.khmer24.com/ad/ANY-STRING/67-123456789.html如何用正則表達式編寫遞歸scrapy規則?

這裏是我的蜘蛛代碼:

from scrapy.item import Item, Field 

class Khmer24(Item): 
    title = Field() 
    price = Field() 

from scrapy.contrib.spiders import CrawlSpider, Rule 
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor 
from scrapy.selector import HtmlXPathSelector 


class MySpider(CrawlSpider): 
    name = "khmer24" 
    allowed_domains = ["www.khmer24.com"] 
    start_urls = ["http://www.khmer24.com/"] 
    #HERE IS WHERE I GET STUCK 
    rules = (Rule (SgmlLinkExtractor(allow=("index/ad\d\s\67-\d00.html",),restrict_xpaths=('//p[@class="nextpage"]',)) 
    , callback="parse_items", follow= True), 
    ) 

    def parse_items(self, response): 
     hxs = HtmlXPathSelector(response) 
     titles = hxs.select("//div[@class='innerbox']") 
     items = [] 
     for title in titles: 
      item = Khmer24() 
      item["title"] = title.select("h1/text()").extract() 
      item["price"] = title.select("table/tr/td/p[@class='description']/span[@class='price']/strong/text()").extract() 
      items.append(item) 
     return(items) 
+0

試試吧! ;-) – 2013-02-27 22:38:15

回答

2

這聽起來像你只是在尋找的允許的XPath的鏈接提取。試試這個:

/ad/[^/]+/67-\d+\.html 

和主頁中可能使這樣的:通過啓動重組你的問題

>>> le = SgmlLinkExtractor(allow=r'/ad/[^/]+/67-\d+\.html') 
>>> le.extract_links(response) 
[Link(url='http://www.khmer24.com/ad/change-petrol-to-gas-use-injector-special-price/67-204320.html', text=u'', fragment='', nofollow=False), 
Link(url='http://www.khmer24.com/ad/i-want-to-sell-my-car-toyota-corolla-s-2003/67-253891.html', text=u'', fragment='', nofollow=False), 
Link(url='http://www.khmer24.com/ad/corolla-altis-2002/67-242425.html', text=u'Corolla Altis 2002', fragment='', nofollow=False), 
Link(url='http://www.khmer24.com/ad/nissan-crew-1997/67-256846.html', text=u'Nissan crew 1997', fragment='', nofollow=False), 
Link(url='http://www.khmer24.com/ad/white-nissan-march-2002/67-198118.html', text=u'White Nissan March 2002', fragment='', nofollow=False), 
Link(url='http://www.khmer24.com/ad/mercedes-s430-black-year-2000-phnom-penh/67-257711.html', text=u'Mercedes S430, Black Year 2000, Phnom Penh', fragment='', nofollow=False), 
Link(url='http://www.khmer24.com/ad/car-for-sale-or-exchangeprado-2007/67-233230.html', text=u'Car for sale or exchange(PRADO 2007)', fragment='', nofollow=False), 
Link(url='http://www.khmer24.com/ad/urgent-toyota-hybrid-pruis-2001-abs-brake/67-164632.html', text=u'URGENT Toyota Hybrid PRUIS 2001 . ABS brake', fragment='', nofollow=False), 
Link(url='http://www.khmer24.com/ad/camry-97-xle-for-sale/67-254704.html', text=u'Camry 97-XLE For Sale', fragment='', nofollow=False), 
Link(url='http://www.khmer24.com/ad/honda-civic-98-silver-for-sale/67-193666.html', text=u'Honda Civic 98 Silver For Sale', fragment='', nofollow=False)] 
+0

我用你的正則表達式,它只顯示幾條記錄,而實際上有一千條記錄 – Vicheanak 2013-02-28 21:52:02

+0

什麼是URL? – 2013-03-01 14:28:18