2014-08-30 85 views
1

我有,我可以執行以下流作業成功不能執行基於Python的Hadoop流作業

sudo -u hdfs hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming.jar -input /sample/apat63_99.txt -output /foo1 -mapper 'wc -l' -numReduceTasks 0 

但5節點的Hadoop集羣,當我嘗試使用Python

執行流工作
sudo -u hdfs hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming.jar -input /sample/apat63_99.txt -output /foo5 -mapper 'AttributeMax.py 8' -file '/tmp/AttributeMax.py' -numReduceTasks 1 

我得到一個錯誤

packageJobJar: [/tmp/AttributeMax.py, /tmp/hadoop-hdfs/hadoop-unjar206224/] [] /tmp/streamjob4074525553604040275.jar tmpDir=null 
14/08/29 11:22:58 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 
14/08/29 11:22:58 INFO mapred.FileInputFormat: Total input paths to process : 1 
14/08/29 11:22:59 INFO streaming.StreamJob: getLocalDirs(): [/tmp/hadoop-hdfs/mapred/local] 
14/08/29 11:22:59 INFO streaming.StreamJob: Running job: job_201408272304_0030 
14/08/29 11:22:59 INFO streaming.StreamJob: To kill this job, run: 
14/08/29 11:22:59 INFO streaming.StreamJob: UNDEF/bin/hadoop job -Dmapred.job.tracker=jt1:8021 -kill job_201408272304_0030 
14/08/29 11:22:59 INFO streaming.StreamJob: Tracking URL: http://jt1:50030/jobdetails.jsp?jobid=job_201408272304_0030 
14/08/29 11:23:00 INFO streaming.StreamJob: map 0% reduce 0% 
14/08/29 11:23:46 INFO streaming.StreamJob: map 100% reduce 100% 
14/08/29 11:23:46 INFO streaming.StreamJob: To kill this job, run: 
14/08/29 11:23:46 INFO streaming.StreamJob: UNDEF/bin/hadoop job -Dmapred.job.tracker=jt1:8021 -kill job_201408272304_0030 
14/08/29 11:23:46 INFO streaming.StreamJob: Tracking URL: http://jt1:50030/jobdetails.jsp?jobid=job_201408272304_0030 
14/08/29 11:23:46 ERROR streaming.StreamJob: Job not successful. Error: NA 
14/08/29 11:23:46 INFO streaming.StreamJob: killJob... 

在我的作業服務器控制檯我看到錯誤

java.io.IOException: log:null 
R/W/S=2359/0/0 in:NA [rec/s] out:NA [rec/s] 
minRecWrittenToEnableSkip_=9223372036854775807 LOGNAME=null 
HOST=null 
USER=mapred 
HADOOP_USER=null 
last Hadoop input: |null| 
last tool output: |null| 
Date: Fri Aug 29 11:22:43 CDT 2014 
java.io.IOException: Broken pipe 
    at java.io.FileOutputStream.writeBytes(Native Method) 
    at java.io.FileOutputStream.write(FileOutputStream.java:282) 
    at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105) 
    at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65) 
    at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109) 
    at java.io.DataOutputStream.write(DataOutputStream.java:90) 
    at org.apache.hadoop.streaming.io.TextInputWriter.writeUTF8(TextInputWriter.java:72) 
    at org.apache.hadoop.streaming.io.TextInputWriter.writeValue(TextInputWriter.java:51) 
    at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:110) 
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) 
    at org.apache.hadoop.streaming.Pipe 

的Python代碼本身是很簡單

#!/usr/bin/env python 
import sys 
index = int(sys.argv[1]) 
max = 0 
for line in sys.stdin 
    fields = line.strip().split(",") 
    if fields[index].isdigit(): 
     val = int(fields[index]) 
     if (val > max): 
      max = val 
     else: 
       print max 

回答

0

我解決了自己的問題。我必須在映射器中指定「python」

sudo -u hdfs hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming.jar 
-input /sample/cite75_99.txt 
-output /foo 
-mapper **'python RandomSample.py 10'** 
-file RandomSale.py 
-numReduceTasks 1