word_tokenize在NLTK不採取字符串列表作爲參數

from nltk.tokenize import word_tokenize 

music_comments = [['So cant you just run the bot outside of the US? ', ''], ["Just because it's illegal doesn't mean it will stop. I hope it actually gets enforced. ", ''], ['Can they do something about all the fucking bots on Tinder next? \n\nEdit: Holy crap my inbox just blew up ', '']] 

print(word_tokenize(music_comments[1]))

我發現this other question這說來傳遞字符串到word_tokenize的名單，但在我的情況下運行上面我得到以下輸出後：word_tokenize在NLTK不採取字符串列表作爲參數

Traceback (most recent call last): 
    File "testing.py", line 5, in <module> 
    print(word_tokenize(music_comments[1])) 
    File "C:\Users\Shraddheya Shendre\Anaconda3\lib\site-packages\nltk\tokenize\__init__.py", line 109, in word_tokenize 
    return [token for sent in sent_tokenize(text, language) 
    File "C:\Users\Shraddheya Shendre\Anaconda3\lib\site-packages\nltk\tokenize\__init__.py", line 94, in sent_tokenize 
    return tokenizer.tokenize(text) 
    File "C:\Users\Shraddheya Shendre\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py", line 1237, in tokenize 
    return list(self.sentences_from_text(text, realign_boundaries)) 
    File "C:\Users\Shraddheya Shendre\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py", line 1285, in sentences_from_text 
    return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)] 
    File "C:\Users\Shraddheya Shendre\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py", line 1276, in span_tokenize 
    return [(sl.start, sl.stop) for sl in slices] 
    File "C:\Users\Shraddheya Shendre\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py", line 1276, in <listcomp> 
    return [(sl.start, sl.stop) for sl in slices] 
    File "C:\Users\Shraddheya Shendre\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py", line 1316, in _realign_boundaries 
    for sl1, sl2 in _pair_iter(slices): 
    File "C:\Users\Shraddheya Shendre\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py", line 310, in _pair_iter 
    prev = next(it) 
    File "C:\Users\Shraddheya Shendre\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py", line 1289, in _slices_from_text 
    for match in self._lang_vars.period_context_re().finditer(text): 
TypeError: expected string or bytes-like object

問題是什麼？我錯過了什麼？

來源

2017-04-01 Shraddheya Shendre

您將ONE字符串傳遞給'word_tokenize（）'，而不是列表。這就是鏈接問題中的代碼所做的。（當然這是你的問題的答案。） – alexis

你餵養兩個項目的名單到tokenize()：

["Just because it's illegal doesn't mean it will stop. I hope it actually gets enforced. ", '']

即句子和一個空字符串。

改變你的代碼，這應該做的伎倆：

print(word_tokenize(music_comments[1][0]))

來源

2017-04-01 14:27:11 patrick

def word_tokenize(self, s): 
    """Tokenize a string to split off punctuation other than periods""" 
    return self._word_tokenizer_re().findall(s)

這是對 'Source code for nltk.tokenize.punkt' 的一部分。

函數word_tokenize()的輸入應該是一個字符串，而不是一個列表。

來源

2017-04-01 14:43:29 xiaoyi

word_tokenize在NLTK不採取字符串列表作爲參數

回答

相關問題