2014-12-03 84 views
0

兩臺運行Ubuntu 14.04.1的機器。相同的源代碼在相同的數據上運行。一個工作正常,一個拋出編解碼器解碼0xe2錯誤。爲什麼是這樣? (更重要的是,我該如何解決這個問題?)兩臺不同機器上的相同python源代碼產生不同的行爲

問題的代碼似乎是:

def tokenize(self): 
    """Tokenizes text using NLTK's tokenizer, starting with sentence tokenizing""" 
    tokenized='' 
    for sentence in sent_tokenize(self): 
     tokenized += ' '.join(word_tokenize(sentence)) + '\n' 

    return Text(tokenized) 

OK ......我進入交互模式和進口sent_tokenize從nltk.tokenize在兩臺機器上。該工程的一個很高興有以下:

>>> fh = open('in/train/legal/legal1a_lm_7.txt') 
>>> foo = fh.read() 
>>> fh.close() 
>>> sent_tokenize(foo) 

的UnicodeDecodeError錯誤的機器上的問題給出了下面的回溯:

Traceback (most recent call last): 
    File "<stdin>", line 1, in <module> 
    File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/__init__.py", line 82, in sent_tokenize 
    return tokenizer.tokenize(text) 
    File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1270, in tokenize 
    return list(self.sentences_from_text(text, realign_boundaries)) 
    File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1318, in sentences_from_text 
    return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)] 
    File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1309, in span_tokenize 
    return [(sl.start, sl.stop) for sl in slices] 
    File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1348, in _realign_boundaries 
    for sl1, sl2 in _pair_iter(slices): 
    File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 355, in _pair_iter 
    for el in it: 
    File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1324, in _slices_from_text 
    if self.text_contains_sentbreak(context): 
    File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1369, in text_contains_sentbreak 
    for t in self._annotate_tokens(self._tokenize_words(text)): 
    File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1504, in _annotate_second_pass 
    for t1, t2 in _pair_iter(tokens): 
    File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 354, in _pair_iter 
    prev = next(it) 
    File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 621, in _annotate_first_pass 
    for aug_tok in tokens: 
    File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 586, in _tokenize_words 
    for line in plaintext.split('\n'): 
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 6: ordinal not in range(128) 

打破輸入一行文件下行線(通過分裂('\ N')),並運行每一個通過sent_tokenize使我們出錯行:

If you have purchased these Services directly from Cisco Systems, Inc. (「Cisco」), this document is incorporated into your Master Services Agreement or equivalent services agreement (「MSA」) executed between you and Cisco. 

這實際上是:

>>> bar[5] 
'If you have purchased these Services directly from Cisco Systems, Inc. (\xe2\x80\x9cCisco\xe2\x80\x9d), this document is incorporated into your Master Services Agreement or equivalent services agreement (\xe2\x80\x9cMSA\xe2\x80\x9d) executed between you and Cisco.' 

更新:兩臺機器顯示的UnicodeDecodeError爲:

unicode(bar[5]) 

但只有一臺機器顯示了一個錯誤:

sent_tokenize(bar[5]) 
+1

請向我們展示引發異常的代碼,以及觸發它的輸入數據和完整回溯。 – 2014-12-03 17:22:30

+1

您仍然需要包含回溯和樣本數據。編輯代碼片段的 – 2014-12-03 17:31:57

+0

。整個項目都在Tk中,所以我會盡量追溯回溯,但可能需要一些時間。看了這段代碼後,我想知道是否將字符串更改爲unicode(u''&u'\ n')可能不是一個好主意...... – dbl 2014-12-03 17:32:08

回答

0

不同NLTK版本!

不barf的版本正在使用NLTK 2.0.4;拋出異常的版本是3.0.0。

NLTK 2.0.4是完全滿意

sent_tokenize('(\xe2\x80\x9cCisco\xe2\x80\x9d)') 

NLTK 3.0.0需求的Unicode(如上面的評論中指出@tdelaney)。因此,要獲得結果,您需要:

sent_tokenize(u'(\u201cCisco\u201d)') 
相關問題