3
新的scrapy,我絕對需要指針。我已經通過一些例子,我沒有得到一些基本知識。我跑scrapy 1.0.3返回基礎:Scrapy
蜘蛛:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from matrix_scrape.items import MatrixScrapeItem
class MySpider(BaseSpider):
name = "matrix"
allowed_domains = ["https://www.kickstarter.com/projects/2061039712/matrix-the-internet-of-things-for-everyonetm"]
start_urls = ["https://www.kickstarter.com/projects/2061039712/matrix-the-internet-of-things-for-everyonetm"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
item = MatrixScrapeItem()
item['backers'] = hxs.select("//*[@id="backers_count"]/data").extract()
item['totalPledged'] = hxs.select("//*[@id="pledged"]/data").extract()
print backers, totalPledged
項目:
import scrapy
class MatrixScrapeItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
backers = scrapy.Field()
totalPledged = scrapy.Field()
pass
,我發現了錯誤:
File "/home/will/Desktop/repos/scrapy/matrix_scrape/matrix_scrape/spiders/test.py", line 15
item['backers'] = hxs.select("//*[@id="backers_count"]/data").extract()
Myquestions是:爲什麼ISN」 t選擇和提取工作正常嗎?我確實看到人們使用Selector而不是HtmlXPathSelector。
此外,我試圖將這保存到一個CSV文件,並基於時間自動化(每隔30分鐘提取這些數據點)。如果任何人有任何指示的例子,他們會得到超級布朗尼點:)
如果你是在Linux上,你可以設置一個cron腳本來運行這條線「0,30 * * * *蟒每30分鐘你_script.py」 – Back2Basics