2017-08-13 202 views
1

我最近試圖用斯坦福分詞器來處理Python中的中文數據。但是當我運行分段器時,我遇到了一些問題。這裏是我在Python中輸入的代碼:關於斯坦福分詞器

segmenter = StanfordSegmenter(path_to_jar = '/Applications/Python3.6/stanford-segmenter/stanford-segmenter.jar', 
           path_to_slf4j = '/Applications/Python3.6/stanford-segmenter/slf4j-api-1.7.25.jar', 
           path_to_sihan_corpora_dict = '/Applications/Python 3.6/stanford-segmenter/data', 
           path_to_model = '/Applications/Python 3.6/stanford-segmenter/data/pku.gz', 
           path_to_dict = '/Applications/Python 3.6/stanford-segmenter/data/dict-chris6.ser.gz' 
          ) 

處理看起來沒問題,因爲我沒有收到任何警告。但是,當我試圖在句子中分割中文單詞時,分段器不工作。

sentence = u'這是斯坦福中文分詞器測試' 
segmenter.segment(sentence) 

Exception in thread "main" java.lang.UnsupportedClassVersionError: edu/stanford/nlp/ie/crf/CRFClassifier : Unsupported major.minor version 52.0 
at java.lang.ClassLoader.defineClass1(Native Method) 
at java.lang.ClassLoader.defineClassCond(ClassLoader.java:637) 
at java.lang.ClassLoader.defineClass(ClassLoader.java:621) 
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141) 
at java.net.URLClassLoader.defineClass(URLClassLoader.java:283) 
at java.net.URLClassLoader.access$000(URLClassLoader.java:58) 
at java.net.URLClassLoader$1.run(URLClassLoader.java:197) 
at java.security.AccessController.doPrivileged(Native Method) 
at java.net.URLClassLoader.findClass(URLClassLoader.java:190) 
at java.lang.ClassLoader.loadClass(ClassLoader.java:306) 
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) 
at java.lang.ClassLoader.loadClass(ClassLoader.java:247) 

Traceback (most recent call last): 
File "<pyshell#21>", line 1, in <module> 
segmenter.segment(sentence) 
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/nltk/tokenize/stanford_segmenter.py", line 96, in segment 
return self.segment_sents([tokens]) 
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/nltk/tokenize/stanford_segmenter.py", line 123, in segment_sents 
stdout = self._execute(cmd) 
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/nltk/tokenize/stanford_segmenter.py", line 143, in _execute 
cmd,classpath=self._stanford_jar, stdout=PIPE, stderr=PIPE) 
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/nltk/internals.py", line 134, in java 
raise OSError('Java command failed : ' + str(cmd)) 
OSError: Java command failed : ['/usr/bin/java', '-mx2g', '-cp', '/Applications/Python 3.6/stanford-segmenter/stanford-segmenter.jar:/Applications/Python 3.6/stanford-segmenter/slf4j-api-1.7.25.jar', 'edu.stanford.nlp.ie.crf.CRFClassifier', '-sighanCorporaDict', '/Applications/Python 3.6/stanford-segmenter/data', '-textFile', '/var/folders/j3/52_wq50j75jfk5ybg6krlw_w0000gn/T/tmpz6dqv1yf', '-sighanPostProcessing', 'true', '-keepAllWhitespaces', 'false', '-loadClassifier', '/Applications/Python 3.6/stanford-segmenter/data/pku.gz', '-serDictionary', '/Applications/Python 3.6/stanford-segmenter/data/dict-chris6.ser.gz', '-inputEncoding', 'UTF-8'] 

我使用Python 3.6.2和Mac OS。我想知道我是否錯過了任何重要的步驟。任何人都可以分享他們解決這個問題的經驗嗎?非常感謝你。

回答

0

TL; DR

持幣觀望了一段時間,等待NLTK v3.2.5那裏將是一個非常簡單的界面,斯坦福斷詞被各種不同的語言規範。

StanfordSegmenterStanfordTokenizer課程將在v3.2.5被廢棄,看到

首先升級nltk版本:

在NLTK v3.2.5

wget http://nlp.stanford.edu/software/stanford-corenlp-full-2016-10-31.zip 
unzip stanford-corenlp-full-2016-10-31.zip && cd stanford-corenlp-full-2016-10-31 
wget http://nlp.stanford.edu/software/stanford-chinese-corenlp-2016-10-31-models.jar 
wget https://raw.githubusercontent.com/stanfordnlp/CoreNLP/master/src/edu/stanford/nlp/pipeline/StanfordCoreNLP-chinese.properties 

java -Xmx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer \ 
-serverProperties StanfordCoreNLP-chinese.properties \ 
-preload tokenize,ssplit,pos,lemma,ner,parse \ 
-status_port 9001 -port 9001 -timeout 15000 

然後:下載並啓動斯坦福CoreNLP服務器

>>> from nltk.tokenize.stanford import CoreNLPTokenizer 
>>> sttok = CoreNLPTokenizer('http://localhost:9001') 
>>> sttok.tokenize(u'我家沒有電腦。') 
['我家', '沒有', '電腦', '。'] 

同時,如果您NLTK的版本是V3.2.4,你可以試試這個:

from nltk.parse.corenlp import CoreNLPParser 
corenlp_parser = CoreNLPParser('http://localhost:9001', encoding='utf8') 
result = corenlp_parser.api_call(text, {'annotators': 'tokenize,ssplit'}) 
tokens = [token['originalText'] or token['word'] for sentence in result['sentences'] for token in sentence['tokens']] 
tokens 

[出]:

['我家', '沒有', '電腦', '。'] 
+0

太謝謝你了。然而,我在Java線遇到一些困難...: 線程「main」中的異常java.lang.UnsupportedClassVersionError:edu/stanford/nlp/pipeline/StanfordCoreNLPServer:不支持的major.minor版本52.0 我已更新Java到最新版本,但它沒有工作。你對這個問題有什麼想法嗎? 謝謝你的幫助。 – MingChe

+0

您需要使用Java 8才能使用斯坦福工具。請使用Java 8重新安裝。Java 7將不起作用。 – alvas

+0

http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html – alvas