2012-07-09 56 views
0

我試圖運行在Mahout的一些TFIDF向量ssvd。當我在Java代碼中運行如下(象夫0.6罐),它工作正常:Q錄用,象夫ssvd

public static void main(String[] args){ 
    runSSVDOnSparseVectors(vectorOutputPath 
    + "/tfidf-vectors/part-r-00000", ssvdOutputPath, 1, 0, 30000, 1); 
} 

private static void runSSVDOnSparseVectors(String inputPath, String outputPath, 
        int rank, int oversampling, int blocks, 
        int reduceTasks) throws IOException { 
    Configuration conf = new Configuration(); 
    // get number of reduce tasks from config? 
    SSVDSolver solver = new SSVDSolver(conf, new Path[] { new Path(
      inputPath) }, new Path(outputPath), blocks, rank, oversampling, 
      reduceTasks); 
    solver.setcUHalfSigma(true); 
    solver.setcVHalfSigma(true); 
    solver.run(); 
} 

我決定,我想將其轉換爲一個bash腳本,只是使用CLI命令來代替,但是當我這樣做,我得到以下錯誤(試過這在0.5版本和0.7,既不工作我可以嘗試0.6,但我不認爲這是一個版本的東西):

[[email protected] lsa]$ $MAHOUT/mahout ssvd -i $H/test_lsa/v_out/tfidf-vectors -o $H/test_lsa/svd_out -k 1 -p 0 -r 30000 -t 1 
Running on hadoop, using /usr/bin/hadoop and HADOOP_CONF_DIR= 
MAHOUT-JOB: /usr/lib/mahout-distribution-0.7/mahout-examples-0.7-job.jar 
12/07/23 15:00:47 INFO common.AbstractJob: Command line arguments: {--abtBlockHeight=[200000], --blockHeight=[30000], --broadcast=[true], --computeU=[true], --computeV=[true], --endPhase=[2147483647], --input=[/path/to/folder/test_lsa/v_out/tfidf-vectors], --minSplitSize=[-1], --outerProdBlockHeight=[30000], --output=[/path/to/folder/test_lsa/svd_out], --oversampling=[0], --pca=[false], --powerIter=[0], --rank=[1], --reduceTasks=[100], --startPhase=[0], --tempDir=[temp], --uHalfSigma=[false], --vHalfSigma=[false]} 
12/07/23 15:00:49 INFO input.FileInputFormat: Total input paths to process : 100 
Exception in thread "main" java.io.IOException: Q job unsuccessful. 
    at org.apache.mahout.math.hadoop.stochasticsvd.QJob.run(QJob.java:230) 
    at org.apache.mahout.math.hadoop.stochasticsvd.SSVDSolver.run(SSVDSolver.java:377) 
    at org.apache.mahout.math.hadoop.stochasticsvd.SSVDCli.run(SSVDCli.java:141) 
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) 
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) 
    at org.apache.mahout.math.hadoop.stochasticsvd.SSVDCli.main(SSVDCli.java:171) 
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) 
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) 
    at java.lang.reflect.Method.invoke(Method.java:597) 
    at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) 
    at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) 
    at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195) 
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) 
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) 
    at java.lang.reflect.Method.invoke(Method.java:597) 
    at org.apache.hadoop.util.RunJar.main(RunJar.java:197) 

我以分佈式模式運行這個在羣集上。我讀過Q作業失敗可能與塊大小有關,但我的作業大於p + k。我也意識到我使用了一個可笑的小輸入(4個向量),但就像我說的,它在Java代碼中工作。我最困惑的是爲什麼它能在java中工作,而不是在CLI中。我很確定我已經得到了這個函數的所有參數。我永遠可以打包Java代碼合併到一個罐子,把它放到bash腳本,但是這將是非常哈克......

日誌作業說:

2012-07-23 15:00:55,413 INFO org.apache.hadoop.util.ProcessTree: setsid exited with exit code 0 
2012-07-23 15:00:55,417 INFO org.apache.hadoop.mapred.Task: Using ResourceCalculatorPlugin : [email protected] 
2012-07-23 15:00:55,638 INFO org.apache.hadoop.io.compress.zlib.ZlibFactory: Successfully loaded & initialized native-zlib library 
2012-07-23 15:00:55,697 ERROR org.apache.mahout.common.IOUtils: new m can't be less than n 
java.lang.IllegalArgumentException: new m can't be less than n 
    at  org.apache.mahout.math.hadoop.stochasticsvd.qr.GivensThinSolver.adjust(GivensThinSolver.java:109) 
    at org.apache.mahout.math.hadoop.stochasticsvd.qr.QRFirstStep.cleanup(QRFirstStep.java:233) 
    at org.apache.mahout.math.hadoop.stochasticsvd.qr.QRFirstStep.close(QRFirstStep.java:89) 
    at org.apache.mahout.common.IOUtils.close(IOUtils.java:128) 
    at org.apache.mahout.math.hadoop.stochasticsvd.QJob$QMapper.cleanup(QJob.java:158) 
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146) 
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647) 
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323) 
    at org.apache.hadoop.mapred.Child$4.run(Child.java:270) 
    at java.security.AccessController.doPrivileged(Native Method) 
    at javax.security.auth.Subject.doAs(Subject.java:396) 
    at org. apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157) 
    at org.apache.hadoop.mapred.Child.main(Child.java:264) 
2012-07-23 15:00:55,731 INFO org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1 
2012-07-23 15:00:55,733 WARN org.apache.hadoop.mapred.Child: Error running child 
java.lang.IllegalArgumentException: new m can't be less than n 
    at org.apache.mahout.math.hadoop.stochasticsvd.qr.GivensThinSolver.adjust(GivensThinSolver.java:109) 
    at org.apache.mahout.math.hadoop.stochasticsvd.qr.QRFirstStep.cleanup(QRFirstStep.java:233) 
    at org.apache.mahout.math.hadoop.stochasticsvd.qr.QRFirstStep.close(QRFirstStep.java:89) 
    at org.apache.mahout.common.IOUtils.close(IOUtils.java:128) 
    at org.apache.mahout.math.hadoop.stochasticsvd.QJob$QMapper.cleanup(QJob.java:158) 
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146) 
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647) 
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323) 
    at org.apache.hadoop.mapred.Child$4.run(Child.java:270) 
    at java.security.AccessController.doPrivileged(Native Method) 
    at javax.security.auth.Subject.doAs(Subject.java:396) 
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157) 
    at org.apache.hadoop.mapred.Child.main(Child.java:264) 
2012-07-23 15:00:55,736 INFO org.apache.hadoop.mapred.Task: Runnning cleanup for the task 

感謝提前幫助。

+0

這是不是足夠的信息。這個跟蹤只是客戶說這個工作失敗了。你需要發佈工人的錯誤。 – 2012-07-09 17:32:45

+0

沒有任何輸出到任務的日誌文件。你是這個意思嗎? – 2012-07-09 19:38:40

+0

有肯定會從任何Hadoop的作業日誌,用它自己的輸出至少。 – 2012-07-09 19:51:06

回答

0

實際上,我認爲,這是因爲有在TFIDF向量一些序列文件是空的,因爲我用了太多減速。這對我來說似乎是一個錯誤。