2015-01-17 17 views
1

我的代碼來自:https://www.kaggle.com/c/word2vec-nlp-tutorial/details/part-2-word-vectors,我讀了成功的數據, 這裏是用來BeautifulSoup和NLTK清潔文本,除去非字母數字卻。Kaggle word2vec比賽,第2部分

def review_to_wordlist(review, remove_stopwords=False): 
    # Function to convert a document to a sequence of words, 
    # optionally removing stop words. Returns a list of words. 
    # 
    # 1. Remove HTML 
    review_text = BeautifulSoup(review).get_text() 
    # 
    # 2. Remove non-letters 
    review_text = re.sub("[^a-zA-Z]"," ", review_text) 
    # 
    # 3. Convert words to lower case and split them 
    words = review_text.lower().split() 
    # 
    # 4. Return a list of words 
    return(words) 

但是當我繼續下去,直到這裏,就不可能前進。

sentences = [] # Initialize an empty list of sentences 

print "Parsing sentences from training set" 
for review in train["review"]: 
    sentences += review_to_sentences(review, tokenizer) 

**error: what is meaning?? the before code runs well, i have tried it 3 times, when the code runs here, appear these problems.** 
Traceback (most recent call last): 
    File "<stdin>", line 2, in <module> 
    File "<stdin>", line 6, in review_to_sentences 
    File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1270, in tokenize 
    return list(self.sentences_from_text(text, realign_boundaries)) 
    File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1318, in sentences_from_text 
    return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)] 
    File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1309, in span_tokenize 
    return [(sl.start, sl.stop) for sl in slices] 
    File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1348, in _realign_boundaries 
    for sl1, sl2 in _pair_iter(slices): 
    File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 355, in _pair_iter 
    for el in it: 
    File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1324, in _slices_from_text 
    if self.text_contains_sentbreak(context): 
    File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1369, in text_contains_sentbreak 
    for t in self._annotate_tokens(self._tokenize_words(text)): 
    File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1504, in _annotate_second_pass 
    for t1, t2 in _pair_iter(tokens): 
    File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 354, in _pair_iter 
    prev = next(it) 
    File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 621, in _annotate_first_pass 
    for aug_tok in tokens: 
    File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 586, in _tokenize_words 
    for line in plaintext.split('\n'): 
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 15: ordinal not in range(128) 
>>> 
+0

當我打印LEN(句子)只有4460通常應該是857234. –

回答

1

這是UnicodeDecodeError錯誤,當你的數據是不正確的編碼類型(它應該是代替「海峽」統一')。更改爲此可能有所幫助:

`sentences += review_to_sentences(review.decode("utf8"), tokenizer)` 

但這可能需要時間。另一種方法是指定編碼「UTF8」在開始的時候,當你閱讀輸入數據:

`pd.read_csv("input_file", encoding="utf-8")` 
+0

謝謝您的回答,它是確定性的Unicode問題。 :) –