您好我想抓取http://economictimes.indiatimes.com/archive.cms的數據,所有的網址都是基於日期,月份和年份進行存檔的,首先獲取url列表我使用https://github.com/FraPochetti/StocksProject/blob/master/financeCrawler/financeCrawler/spiders/urlGenerator.py的代碼修改了我的網站作爲代碼,從scrapy的網站檔案中遞歸地提取URL
import scrapy
import urllib
def etUrl():
totalWeeks = []
totalPosts = []
url = 'http://economictimes.indiatimes.com/archive.cms'
data = urllib.urlopen(url).read()
hxs = scrapy.Selector(text=data)
months = hxs.xpath('//ul/li/a').re('http://economictimes.indiatimes.com/archive.cms/\\d+-\\d+/news.cms')
admittMonths = 12*(2013-2007) + 8
months = months[:admittMonths]
for month in months:
data = urllib.urlopen(month).read()
hxs = scrapy.Selector(text=data)
weeks = hxs.xpath('//ul[@class="weeks"]/li/a').re('http://economictimes.indiatimes.com/archive.cms/\\d+-\\d+/news/day\\d+\.cms')
totalWeeks += weeks
for week in totalWeeks:
data = urllib.urlopen(week).read()
hxs = scrapy.Selector(text=data)
posts = hxs.xpath('//ul[@class="archive"]/li/h1/a/@href').extract()
totalPosts += posts
with open("eturls.txt", "a") as myfile:
for post in totalPosts:
post = post + '\n'
myfile.write(post)
etUrl()
保存文件作爲urlGenerator.py
並用命令$ python urlGenerator.py
我越來越沒有結果,可能有人幫助我如何採取爲我的網站使用情況或任何其他解決方案的代碼跑?
是否存在對'etUrl()'的調用,傳統上由'if __name__ ==「__main__」:etUrl()'類型結構保護? –
它也**非常WEIRD **來安裝Scrapy,但隨後使用基於'urllib'的請求響應;可以說,Scrapy的50%的力量在於它如何處理整個過程 - 包括有明確的回調,以避免你在那裏進行4深刻的縮進 –
我冒昧地整理了你的文章,因爲我假設你不是故意在底部遞歸調用etUrl()... – Iguananaut