我試過廢棄不同的頁面。首先,我使用解析函數中的xpath(@href)從第一頁中提取網址。然後我嘗試從URL解析函數請求回調中刪除文章。但它不工作.... 我該如何解決這個問題。使用scrapy廢除不同的頁面
import scrapy
from string import join
from article.items import ArticleItem
class ArticleSpider(scrapy.Spider):
name = "article"
allowed_domains = ["http://joongang.joins.com"]
j_classifications = ['politics','money','society','culture']
start_urls = ["http://news.joins.com/politics",
"http://news.joins.com/society",
"http://news.joins.com/money",]
def parse(self, response):
sel = scrapy.Selector(response)
urls = sel.xpath('//div[@class="bd"]/ul/li/strong[@class="headline mg"]')
items = []
for url in urls:
item = ArticleItem()
item['url'] = url.xpath('a/@href').extract()
item['url'] = "http://news.joins.com"+join(item['url'])
items.append(item['url'])
for itm in items:
yield scrapy.Request(itm,callback=self.parse2,meta={'item':item})
def parse2(self, response):
item = response.meta['item']
sel = scrapy.Selector(response)
articles = sel.xpath('//div[@id="article_body"]')
for article in articles:
item['article'] = article.xpath('text()').extract()
items.append(item['article'])
return items
您是否嘗試過'在item中輸入itm後執行'print itm'並檢查它是否返回有效的url? – Vaulstein
'print itm'的結果是URL(unicode的類型)http://news.joins.com/article/18860833 – LeeJinHee
你得到的錯誤是什麼?或者沒有什麼是屈服? – Vaulstein