2014-10-20 49 views
4

我使用Python3在Ubuntu 14.04,和我對67頁的原始文本的文章語料庫運行斯坦福POSTagger,thje節錄python腳本如下:斯坦福POSTagger,Java堆SPCE內存

from nltk.tag.stanford import POSTagger 

with open('the_file.txt','r') as file: 
    G=file.readlines() 

stan=[] 

english_postagger = POSTagger('models/english-bidirectional-distsim.tagger', 'stanford-postagger.jar') 

for line in g: 
    stan.append(english_postagger.tag(tokenize_fast(line))) 

經過多次迭代其中,我得到以下錯誤:

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at edu.stanford.nlp.sequences.ExactBestSequenceFinder.bestSequence(ExactBestSequenceFinder.java:109) 
at edu.stanford.nlp.sequences.ExactBestSequenceFinder.bestSequence(ExactBestSequenceFinder.java:31) 
at edu.stanford.nlp.tagger.maxent.TestSentence.runTagInference(TestSentence.java:322) 
at edu.stanford.nlp.tagger.maxent.TestSentence.testTagInference(TestSentence.java:312) 
at edu.stanford.nlp.tagger.maxent.TestSentence.tagSentence(TestSentence.java:135) 
at edu.stanford.nlp.tagger.maxent.MaxentTagger.tagSentence(MaxentTagger.java:998) 
at edu.stanford.nlp.tagger.maxent.MaxentTagger.tagCoreLabelsOrHasWords(MaxentTagger.java:1788) 
at edu.stanford.nlp.tagger.maxent.MaxentTagger.tagAndOutputSentence(MaxentTagger.java:1798) 
at edu.stanford.nlp.tagger.maxent.MaxentTagger.runTagger(MaxentTagger.java:1709) 
at edu.stanford.nlp.tagger.maxent.MaxentTagger.runTagger(MaxentTagger.java:1770) 
at edu.stanford.nlp.tagger.maxent.MaxentTagger.runTagger(MaxentTagger.java:1543) 
at edu.stanford.nlp.tagger.maxent.MaxentTagger.runTagger(MaxentTagger.java:1499) 
at edu.stanford.nlp.tagger.maxent.MaxentTagger.main(MaxentTagger.java:1842) 

我也運行在命令行斯坦福postagger爲:

java -mx300m -classpath stanford-postagger.jar edu.stanford.nlp.tagger.maxent.MaxentTagger -model models/wsj-0-18-bidirectional-distsim.tagger -textFile sample-input.txt > sample-tagged.txt 

也有類似的錯誤。我甚至通過了Java 2 GB的內存,但仍然沒有運氣。

任何想法/想法或哈克式解決方案非常受歡迎!

好看準@nsanglar,所以我嘗試:

java -Xmx2g -classpath stanford-postagger.jar edu.stanford.nlp.tagger.maxent.MaxentTagger -model models/wsj-0-18-bidirectional-distsim.tagger -textFile raw_text.txt > sample-tagged.txt 

我得到一個錯誤日誌消息,與下面的頭:

# There is insufficient memory for the Java Runtime Environment to continue. 
# Native memory allocation (malloc) failed to allocate 283639808 bytes for committing reserved memory. 
# Possible reasons: 
# The system is out of physical RAM or swap space 
# In 32 bit mode, the process size limit was hit 
# Possible solutions: 
# Reduce memory load on the system 
# Increase physical memory or swap space 
# Check if swap backing store is full 
# Use 64 bit Java on a 64 bit OS 
#  Decrease Java heap size (-Xmx/-Xms) 
# Decrease number of Java threads 
# Decrease Java thread stack sizes (-Xss) 
# Set larger code cache with -XX:ReservedCodeCacheSize= 
# This output file may be truncated or incomplete. 

# Out of Memory Error (os_linux.cpp:2798), pid=25677, tid=140571167794944 

# JRE version: OpenJDK Runtime Environment (7.0_65-b32) (build 1.7.0_65-b32) 
# Java VM: OpenJDK 64-Bit Server VM (24.65-b04 mixed mode linux-amd64 compressed oops) 
# Derivative: IcedTea 2.5.2 
# Distribution: Ubuntu 14.04 LTS, package 7u65-2.5.2-3~14.04 
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again 
+0

我從來沒有見過這樣的錯誤日誌,但似乎您試圖分配太多的內存應用程序(例如2go的)。你能少嘗試嗎?試試-Xmx512m或-Xmx1024m,它可能會更好。 – nsanglar 2014-10-20 14:13:37

+0

感謝您的幫助,我按照您的建議做了,並再次得到:異常在線程「主」java.lang.OutOfMemoryError:Java堆空間 – laila 2014-10-20 15:34:26

回答

2

嗯,事實證明這是一個RAM問題,我只是沒有足夠的內存來執行命令。將服務器從服務器上運行起來了。

1

你應該使用-Xmx1024m。我認爲你犯了一個錯字,因爲目前你正在使用-mx :)

+0

感謝您指出,雖然它並沒有解決問題,但unfortunatley。 – laila 2014-10-20 13:44:30

0

在Python中設置:

nltk.internals.config_java(options='-Xmx3024m') 
+1

考慮修改你的答案,所以它清楚地描述了意圖 – lfender6445 2015-02-03 07:05:19

+0

謝謝,這條線爲我工作(與lexParser而不是tagger) – Igor 2018-01-10 08:33:56