首先,如果該文件是在 'UTF8',你正在使用Python2,它會更好,如果你使用encoding='utf8'
參數io.open()
:
import io
from nltk import word_tokenize, sent_tokenize
with io.open('file.txt', 'r', encoding='utf8') as fin:
document = []
for line in fin:
tokens += [word_tokenize(sent) for sent in sent_tokenize(line)]
如果是Python3,只需執行以下操作:
from nltk import word_tokenize
with open('file.txt', 'r') as fin:
document = []
for line in fin:
tokens += [word_tokenize(sent) for sent in sent_tokenize(line)]
Do tak EA看看http://nedbatchelder.com/text/unipain.html
至於標記化,如果我們假設每行包含着一些段落可能會由一個或多個句子的,我們想先初始列表存儲整個文檔:
document = []
然後,我們通過管線迭代和向上分割線成句子:
for line in fin:
sentences = sent_tokenize(line)
然後我們向上分割句子變成噸okens:
token = [word_tokenize(sent) for sent in sent_tokenize(line)]
既然我們要更新我們的文檔列表存儲標記化的句子,我們使用:
document = []
for line in fin:
tokens += [word_tokenize(sent) for sent in sent_tokenize(line)]
不推薦!(但仍然可以在一行中):
[email protected]:~$ cat file.txt
this is a paragph. with many sentences.
yes, hahaah.. wahahha...
[email protected]:~$ python
Python 2.7.11+ (default, Apr 17 2016, 14:00:29)
[GCC 5.3.1 20160413] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import io
>>> from itertools import chain
>>> from nltk import sent_tokenize, word_tokenize
>>> list(chain(*[[word_tokenize(sent) for sent in sent_tokenize(line)] for line in io.open('file.txt', 'r', encoding='utf8')]))
[[u'this', u'is', u'a', u'paragph', u'.'], [u'with', u'many', u'sentences', u'.'], [u'yes', u',', u'hahaah..', u'wahahha', u'...']]
先解碼,然後小寫。否則,你會得到不正確的行爲與非ascii字符。 – alexis