2013-04-24 106 views
1

我想在mahout中使用k-means將一些手工製作的日期聚類在一起。我創建了6個文件,每個文件中幾乎沒有1或2個單詞文本。使用./mahout seqdirectory創建一個序列文件。當嘗試使用./mahout seq2sparse命令將序列文件轉換爲向量時,我得到java.lang.OutOfMemoryError:Java堆空間錯誤。序列文件的大小爲.215 KB。java.lang.OutOfMemoryError:在mahout中運行seq2sparse時發生Java堆空間錯誤

命令:./mahout seq2sparse -i mokha /輸出-o mokha /矢量-ow

錯誤日誌:

SLF4J: Class path contains multiple SLF4J bindings. 
SLF4J: Found binding in [jar:file:/home/bitnami/mahout/mahout-distribution-0.5/m 
ahout-examples-0.5-job.jar!/org/slf4j/impl/StaticLoggerBinder.class] 
SLF4J: Found binding in [jar:file:/home/bitnami/mahout/mahout-distribution-0.5/l 
ib/slf4j-jcl-1.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class] 
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. 
Apr 24, 2013 2:25:11 AM org.slf4j.impl.JCLLoggerAdapter warn 
WARNING: No seq2sparse.props found on classpath, will use command-line arguments 
only 
Apr 24, 2013 2:25:12 AM org.slf4j.impl.JCLLoggerAdapter info 
INFO: Maximum n-gram size is: 1 
Apr 24, 2013 2:25:12 AM org.slf4j.impl.JCLLoggerAdapter info 
INFO: Deleting mokha/vector 
Apr 24, 2013 2:25:12 AM org.slf4j.impl.JCLLoggerAdapter info 
INFO: Minimum LLR value: 1.0 
Apr 24, 2013 2:25:12 AM org.slf4j.impl.JCLLoggerAdapter info 
INFO: Number of reduce tasks: 1 
Apr 24, 2013 2:25:12 AM org.apache.hadoop.metrics.jvm.JvmMetrics init 
INFO: Initializing JVM Metrics with processName=JobTracker, sessionId= 
Apr 24, 2013 2:25:12 AM org.apache.hadoop.mapreduce.lib.input.FileInputFormat li 
stStatus 
INFO: Total input paths to process : 1 
Apr 24, 2013 2:25:12 AM org.apache.hadoop.mapred.JobClient monitorAndPrintJob 
INFO: Running job: job_local_0001 
Apr 24, 2013 2:25:12 AM org.apache.hadoop.mapreduce.lib.input.FileInputFormat li 
stStatus 
INFO: Total input paths to process : 1 
Apr 24, 2013 2:25:12 AM org.apache.hadoop.mapred.Task done 
INFO: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commi 
ting 
Apr 24, 2013 2:25:12 AM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate 
INFO: 
Apr 24, 2013 2:25:12 AM org.apache.hadoop.mapred.Task commit 
INFO: Task attempt_local_0001_m_000000_0 is allowed to commit now 
Apr 24, 2013 2:25:12 AM org.apache.hadoop.mapreduce.lib.output.FileOutputCommitt 
er commitTask 
INFO: Saved output of task 'attempt_local_0001_m_000000_0' to mokha/vector/token 
ized-documents 
Apr 24, 2013 2:25:12 AM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate 
INFO: 
Apr 24, 2013 2:25:12 AM org.apache.hadoop.mapred.Task sendDone 
INFO: Task 'attempt_local_0001_m_000000_0' done. 
Apr 24, 2013 2:25:13 AM org.apache.hadoop.mapred.JobClient monitorAndPrintJob 
INFO: map 100% reduce 0% 
Apr 24, 2013 2:25:13 AM org.apache.hadoop.mapred.JobClient monitorAndPrintJob 
INFO: Job complete: job_local_0001 
Apr 24, 2013 2:25:13 AM org.apache.hadoop.mapred.Counters log 
INFO: Counters: 5 
Apr 24, 2013 2:25:13 AM org.apache.hadoop.mapred.Counters log 
INFO: FileSystemCounters 
Apr 24, 2013 2:25:13 AM org.apache.hadoop.mapred.Counters log 
INFO:  FILE_BYTES_READ=1471400 
Apr 24, 2013 2:25:13 AM org.apache.hadoop.mapred.Counters log 
INFO:  FILE_BYTES_WRITTEN=1496783 
Apr 24, 2013 2:25:13 AM org.apache.hadoop.mapred.Counters log 
INFO: Map-Reduce Framework 
Apr 24, 2013 2:25:13 AM org.apache.hadoop.mapred.Counters log 
INFO:  Map input records=6 
Apr 24, 2013 2:25:13 AM org.apache.hadoop.mapred.Counters log 
INFO:  Spilled Records=0 
Apr 24, 2013 2:25:13 AM org.apache.hadoop.mapred.Counters log 
INFO:  Map output records=6 
Apr 24, 2013 2:25:13 AM org.apache.hadoop.metrics.jvm.JvmMetrics init 
INFO: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - al 
ready initialized 
Apr 24, 2013 2:25:13 AM org.apache.hadoop.mapreduce.lib.input.FileInputFormat li 
stStatus 
INFO: Total input paths to process : 1 
Apr 24, 2013 2:25:13 AM org.apache.hadoop.mapred.JobClient monitorAndPrintJob 
INFO: Running job: job_local_0002 
Apr 24, 2013 2:25:13 AM org.apache.hadoop.mapreduce.lib.input.FileInputFormat li 
stStatus 
INFO: Total input paths to process : 1 
Apr 24, 2013 2:25:13 AM org.apache.hadoop.mapred.MapTask$MapOutputBuffer <init> 
INFO: io.sort.mb = 100 
Apr 24, 2013 2:25:14 AM org.apache.hadoop.mapred.LocalJobRunner$Job run 
WARNING: job_local_0002 
java.lang.OutOfMemoryError: Java heap space 
     at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.<init>(MapTask.java: 
781) 
     at org.apache.hadoop.mapred.MapTask$NewOutputCollector.<init>(MapTask.ja 
va:524) 
     at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:613) 
     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) 
     at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:1 
77) 
Apr 24, 2013 2:25:14 AM org.apache.hadoop.mapred.JobClient monitorAndPrintJob 
INFO: map 0% reduce 0% 
Apr 24, 2013 2:25:14 AM org.apache.hadoop.mapred.JobClient monitorAndPrintJob 
INFO: Job complete: job_local_0002 
Apr 24, 2013 2:25:14 AM org.apache.hadoop.mapred.Counters log 
INFO: Counters: 0 
Apr 24, 2013 2:25:14 AM org.apache.hadoop.metrics.jvm.JvmMetrics init 
INFO: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - al 
ready initialized 
Apr 24, 2013 2:25:15 AM org.apache.hadoop.mapreduce.lib.input.FileInputFormat li 
stStatus 
INFO: Total input paths to process : 1 
Apr 24, 2013 2:25:15 AM org.apache.hadoop.mapred.JobClient monitorAndPrintJob 
INFO: Running job: job_local_0003 
Apr 24, 2013 2:25:15 AM org.apache.hadoop.mapreduce.lib.input.FileInputFormat li 
stStatus 
INFO: Total input paths to process : 1 
Apr 24, 2013 2:25:15 AM org.apache.hadoop.mapred.MapTask$MapOutputBuffer <init> 
INFO: io.sort.mb = 100 
Apr 24, 2013 2:25:15 AM org.apache.hadoop.mapred.LocalJobRunner$Job run 
WARNING: job_local_0003 
java.lang.OutOfMemoryError: Java heap space 
     at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.<init>(MapTask.java: 
781) 
     at org.apache.hadoop.mapred.MapTask$NewOutputCollector.<init>(MapTask.ja 
va:524) 
     at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:613) 
     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) 
     at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:1 
77) 
Apr 24, 2013 2:25:16 AM org.apache.hadoop.mapred.JobClient monitorAndPrintJob 
INFO: map 0% reduce 0% 
Apr 24, 2013 2:25:16 AM org.apache.hadoop.mapred.JobClient monitorAndPrintJob 
INFO: Job complete: job_local_0003 
Apr 24, 2013 2:25:16 AM org.apache.hadoop.mapred.Counters log 
INFO: Counters: 0 
Apr 24, 2013 2:25:16 AM org.apache.hadoop.metrics.jvm.JvmMetrics init 
INFO: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - al 
ready initialized 
Apr 24, 2013 2:25:16 AM org.apache.hadoop.mapreduce.lib.input.FileInputFormat li 
stStatus 
INFO: Total input paths to process : 0 
Apr 24, 2013 2:25:16 AM org.apache.hadoop.mapred.JobClient monitorAndPrintJob 
INFO: Running job: job_local_0004 
Apr 24, 2013 2:25:16 AM org.apache.hadoop.mapreduce.lib.input.FileInputFormat li 
stStatus 
INFO: Total input paths to process : 0 
Apr 24, 2013 2:25:16 AM org.apache.hadoop.mapred.LocalJobRunner$Job run 
WARNING: job_local_0004 
java.lang.IndexOutOfBoundsException: Index: 0, Size: 0 
     at java.util.ArrayList.RangeCheck(ArrayList.java:547) 
     at java.util.ArrayList.get(ArrayList.java:322) 
     at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:1 
24) 
Apr 24, 2013 2:25:17 AM org.apache.hadoop.mapred.JobClient monitorAndPrintJob 
INFO: map 0% reduce 0% 
Apr 24, 2013 2:25:17 AM org.apache.hadoop.mapred.JobClient monitorAndPrintJob 
INFO: Job complete: job_local_0004 
Apr 24, 2013 2:25:17 AM org.apache.hadoop.mapred.Counters log 
INFO: Counters: 0 
Apr 24, 2013 2:25:17 AM org.slf4j.impl.JCLLoggerAdapter info 
INFO: Deleting mokha/vector/partial-vectors-0 
Apr 24, 2013 2:25:17 AM org.apache.hadoop.metrics.jvm.JvmMetrics init 
INFO: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - al 
ready initialized 
Exception in thread "main" org.apache.hadoop.mapreduce.lib.input.InvalidInputExc 
eption: Input path does not exist: file:/home/bitnami/mahout/mahout-distribution 
-0.5/bin/mokha/vector/tf-vectors 
     at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(File 
InputFormat.java:224) 
     at org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listSta 
tus(SequenceFileInputFormat.java:55) 
     at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileI 
nputFormat.java:241) 
     at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:885) 
     at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:7 
79) 
     at org.apache.hadoop.mapreduce.Job.submit(Job.java:432) 
     at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447) 
     at org.apache.mahout.vectorizer.tfidf.TFIDFConverter.startDFCounting(TFI 
DFConverter.java:350) 
     at org.apache.mahout.vectorizer.tfidf.TFIDFConverter.processTfIdf(TFIDFC 
onverter.java:151) 
     at org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.run(Spars 
eVectorsFromSequenceFiles.java:262) 
     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) 
     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) 
     at org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.main(Spar 
seVectorsFromSequenceFiles.java:52) 
     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
     at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl. 
java:39) 
     at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces 
sorImpl.java:25) 
     at java.lang.reflect.Method.invoke(Method.java:597) 
     at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(Progra 
mDriver.java:68) 
     at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) 
     at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:187) 

回答

0

我不知道你有沒有試過,但剛剛發佈如果你錯過了它。

'Set the environment variable 'MAVEN_OPTS' to allow for more memory via 'export MAVEN_OPTS=-Xmx1024m' 

參考(在共同的問題部分)here

+0

我們不使用Maven。 – 2013-04-24 09:47:20

1

的bin /亨利馬烏腳本讀取環境變量「MAHOUT_HEAPSIZE」(在兆字節),並且將「JAVA_HEAP_MAX」變量從它,如果它的存在。我使用的mahout版本(0.8)將JAVA_HEAP_MAX設置爲3G。執行

export MAHOUT_HEAPSIZE=10000m 

在冠層集羣運行之前,似乎幫助我的運行在單臺計算機上保持更長時間。但是,我懷疑最好的解決方案是切換到羣集上運行。

參考,還有另外一個相關的帖子: Mahout runs out of heap space

+0

我不認爲變量中允許有單位,應該是'export MAHOUT_HEAPSIZE = 10000' – tokland 2016-10-03 09:27:08

相關問題