我正在解析用西班牙語寫的網頁scrapy。問題是,由於編碼錯誤,我無法保存文本。解析西班牙語文本並將其保存在分號中
這是解析函數:
def parse(self, response):
hxs = HtmlXPathSelector(response)
text = hxs.select('//text()').extract() # Ex: [u' Sustancia mineral, m\xe1s o menos dura y compacta, que no es terrosa ni de aspecto met\xe1lico.']
s = "".join(text)
db = dbf.Dbf("test.dbf", new=True)
db.addField(
("WORD", "C", 25),
("DATA", "M", 15000), # Memo field
)
rec = db.newRecord()
rec["WORD"] = "Stone"
rec["DATA"] = s
rec.store()
db.close()
當我嘗試將其保存到一個分貝(A DBF數據庫)我得到一個ASCII(128)錯誤。我嘗試使用'utf-8'和'latin1'進行解碼/編碼,但沒有成功。
編輯:
爲了節省我使用dbfpy分貝。我在上面的解析函數中添加了dbf保存代碼。
這是錯誤消息:
Traceback (most recent call last):
File "/usr/lib/python2.6/dist-packages/twisted/internet/base.py", line 1179, in mainLoop
self.runUntilCurrent()
File "/usr/lib/python2.6/dist-packages/twisted/internet/base.py", line 778, in runUntilCurrent
call.func(*call.args, **call.kw)
File "/usr/lib/python2.6/dist-packages/twisted/internet/defer.py", line 280, in callback
self._startRunCallbacks(result)
File "/usr/lib/python2.6/dist-packages/twisted/internet/defer.py", line 354, in _startRunCallbacks
self._runCallbacks()
--- <exception caught here> ---
File "/usr/lib/python2.6/dist-packages/twisted/internet/defer.py", line 371, in _runCallbacks
self.result = callback(self.result, *args, **kw)
File "/home/katy/Dropbox/proyectos/rae/rae/spiders/rae_spider.py", line 54, in parse
rec.store()
File "/home/katy/Dropbox/proyectos/rae/rae/spiders/record.py", line 211, in store
self.dbf.append(self)
File "/home/katy/Dropbox/proyectos/rae/rae/spiders/dbf.py", line 214, in append
record._write()
File "/home/katy/Dropbox/proyectos/rae/rae/spiders/record.py", line 173, in _write
self.dbf.stream.write(self.toString())
File "/home/katy/Dropbox/proyectos/rae/rae/spiders/record.py", line 223, in toString
for (_def, _dat) in izip(self.dbf.header.fields, self.fieldData)
File "/home/katy/Dropbox/proyectos/rae/rae/spiders/fields.py", line 215, in encodeValue
return str(value)[:self.length].ljust(self.length)
exceptions.UnicodeEncodeError: 'ascii' codec can't encode character u'\xf1' in position 18: ordinal not in range(128)
你可以發佈你看到的實際錯誤,以及導致它的代碼嗎? – SimonJ 2010-11-13 23:34:45
所以你從你的網頁上獲得unicode。這很好,就像它應該那樣。這不是你的問題。你的問題是「將它保存爲dbf數據庫」 - 你需要顯示試圖這樣做的代碼;我們沒有水晶球。您需要給我們一個指向您正在使用的dbf-handling模塊的鏈接。 – 2010-11-13 23:36:13
另請確認您的意思是dBase III等Visual Foxpro使用的DBF文件 - 如果不是,它是什麼? – 2010-11-13 23:43:09