2016-04-27 131 views
2

我有兩個python腳本映射器和reducer(基本上reducer在這一點上只是沒有打印其他任何東西),而本地我得到4個結果 - 字符串 在hadoop我得到3.這是如何工作的?Hadoop返回的結果比預期的要少

我使用Amazon彈性地圖減少使用Hadoop

mapper.py

#!/usr/bin/env python 

import sys 
import re 
import os 
# Constants declaration 

WINDOW = 10 
OVERLAP = 4 
START_POSITION = 0 
END_POSITION = 0 

# regular expressions 

pattern = re.compile("[a-z]*", re.IGNORECASE) 

a_to_f_pattern = re.compile("[a-f]", re.IGNORECASE) 
g_to_l_pattern = re.compile("[g-l]", re.IGNORECASE) 
m_to_r_pattern = re.compile("[m-r]", re.IGNORECASE) 
s_to_z_pattern = re.compile("[s-z]", re.IGNORECASE) 

# variables initialization 

converted_word = "" 
next_word = "" 
new_character = "" 
filename = "" 
prev_filename = "" 
i = 0 



# Read pairs as lines of input from STDIN 
for line in sys.stdin: 

    line.strip() 

    filename = os.environ['mapreduce_map_input_file'] 
    filename = filename.replace("s3://source123/input/","") 


    # check if its a new file, and reset start position 
    if filename != prev_filename: 

     START_POSITION = 0 
     next_word = "" 
     converted_word = "" 
     prev_filename = filename 

    # loop through every word that matches the pattern 
    for word in pattern.findall(line): 


       new_character = convert(word) 
       converted_word = converted_word + new_character 

       if len(converted_word) > (WINDOW - OVERLAP): 
        next_word = next_word + new_character 

       # print "word= ", word 
       # print "converted_word= ", converted_word 
      else: 

       END_POSITION = START_POSITION + (len(converted_word) - 1) 

       print converted_word + "," + str(filename) + "," + str(START_POSITION) + "," + str(END_POSITION) 

       START_POSITION = START_POSITION + (WINDOW - OVERLAP) 
       new_character = convert(word) 
       converted_word = next_word + new_character 

日誌

2016-04-27 19:58:41,293 INFO com.amazon.ws.emr.hadoop.fs.EmrFileSystem (main): Consistency disabled, using com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem as filesystem implementation 
2016-04-27 19:58:41,512 INFO amazon.emr.metrics.MetricsSaver (main): MetricsConfigRecord disabledInCluster: false instanceEngineCycleSec: 60 clusterEngineCycleSec: 60 disableClusterEngine: true maxMemoryMb: 3072 maxInstanceCount: 500 lastModified: 1461784308237 
2016-04-27 19:58:41,512 INFO amazon.emr.metrics.MetricsSaver (main): Created MetricsSaver j-KCDMFZJGYO89:i-995f5a41:RunJar:16480 period:60 /mnt/var/em/raw/i-995f5a41_20160427_RunJar_16480_raw.bin 
2016-04-27 19:58:43,477 INFO org.apache.hadoop.yarn.client.RMProxy (main): Connecting to ResourceManager at ip-172-31-38-52.us-west-2.compute.internal/172.31.38.52:8032 
2016-04-27 19:58:43,673 INFO org.apache.hadoop.yarn.client.RMProxy (main): Connecting to ResourceManager at ip-172-31-38-52.us-west-2.compute.internal/172.31.38.52:8032 
2016-04-27 19:58:44,156 INFO com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem (main): Opening 's3://source123/mapper.py' for reading 
2016-04-27 19:58:44,267 INFO amazon.emr.metrics.MetricsSaver (main): Thread 1 created MetricsLockFreeSaver 1 
2016-04-27 19:58:44,439 INFO com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem (main): Opening 's3://source123/source_reducer.py' for reading 
2016-04-27 19:58:44,628 INFO com.hadoop.compression.lzo.GPLNativeCodeLoader (main): Loaded native gpl library 
2016-04-27 19:58:44,630 INFO com.hadoop.compression.lzo.LzoCodec (main): Successfully loaded & initialized native-lzo library [hadoop-lzo rev 426d94a07125cf9447bb0c2b336cf10b4c254375] 
2016-04-27 19:58:45,046 INFO com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem (main): listStatus s3://source123/input with recursive false 
2016-04-27 19:58:45,265 INFO org.apache.hadoop.mapred.FileInputFormat (main): Total input paths to process : 1 
2016-04-27 19:58:45,336 INFO org.apache.hadoop.mapreduce.JobSubmitter (main): number of splits:9 
2016-04-27 19:58:45,565 INFO org.apache.hadoop.mapreduce.JobSubmitter (main): Submitting tokens for job: job_1461784297295_0004 
2016-04-27 19:58:45,710 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl (main): Submitted application application_1461784297295_0004 
2016-04-27 19:58:45,743 INFO org.apache.hadoop.mapreduce.Job (main): The url to track the job: http://ip-172-31-38-52.us-west-2.compute.internal:20888/proxy/application_1461784297295_0004/ 
2016-04-27 19:58:45,744 INFO org.apache.hadoop.mapreduce.Job (main): Running job: job_1461784297295_0004 
2016-04-27 19:58:53,876 INFO org.apache.hadoop.mapreduce.Job (main): Job job_1461784297295_0004 running in uber mode : false 
2016-04-27 19:58:53,877 INFO org.apache.hadoop.mapreduce.Job (main): map 0% reduce 0% 
2016-04-27 19:59:11,063 INFO org.apache.hadoop.mapreduce.Job (main): map 11% reduce 0% 
2016-04-27 19:59:14,081 INFO org.apache.hadoop.mapreduce.Job (main): map 22% reduce 0% 
2016-04-27 19:59:16,094 INFO org.apache.hadoop.mapreduce.Job (main): map 33% reduce 0% 
2016-04-27 19:59:18,106 INFO org.apache.hadoop.mapreduce.Job (main): map 56% reduce 0% 
2016-04-27 19:59:19,114 INFO org.apache.hadoop.mapreduce.Job (main): map 67% reduce 0% 
2016-04-27 19:59:26,159 INFO org.apache.hadoop.mapreduce.Job (main): map 78% reduce 0% 
2016-04-27 19:59:29,178 INFO org.apache.hadoop.mapreduce.Job (main): map 89% reduce 0% 
2016-04-27 19:59:30,184 INFO org.apache.hadoop.mapreduce.Job (main): map 100% reduce 0% 
2016-04-27 19:59:32,196 INFO org.apache.hadoop.mapreduce.Job (main): map 100% reduce 33% 
2016-04-27 19:59:34,207 INFO org.apache.hadoop.mapreduce.Job (main): map 100% reduce 67% 
2016-04-27 19:59:38,228 INFO org.apache.hadoop.mapreduce.Job (main): map 100% reduce 100% 
2016-04-27 19:59:40,246 INFO org.apache.hadoop.mapreduce.Job (main): Job job_1461784297295_0004 completed successfully 
2016-04-27 19:59:40,409 INFO org.apache.hadoop.mapreduce.Job (main): Counters: 55 
    File System Counters 
     FILE: Number of bytes read=190 
     FILE: Number of bytes written=1541379 
     FILE: Number of read operations=0 
     FILE: Number of large read operations=0 
     FILE: Number of write operations=0 
     HDFS: Number of bytes read=873 
     HDFS: Number of bytes written=0 
     HDFS: Number of read operations=9 
     HDFS: Number of large read operations=0 
     HDFS: Number of write operations=0 
     S3: Number of bytes read=864 
     S3: Number of bytes written=130 
     S3: Number of read operations=0 
     S3: Number of large read operations=0 
     S3: Number of write operations=0 
    Job Counters 
     Killed map tasks=1 
     Launched map tasks=9 
     Launched reduce tasks=3 
     Data-local map tasks=9 
     Total time spent by all maps in occupied slots (ms)=6351210 
     Total time spent by all reduces in occupied slots (ms)=2449170 
     Total time spent by all map tasks (ms)=141138 
     Total time spent by all reduce tasks (ms)=27213 
     Total vcore-milliseconds taken by all map tasks=141138 
     Total vcore-milliseconds taken by all reduce tasks=27213 
     Total megabyte-milliseconds taken by all map tasks=203238720 
     Total megabyte-milliseconds taken by all reduce tasks=78373440 
    Map-Reduce Framework 
     Map input records=5 
     Map output records=3 
     Map output bytes=124 
     Map output materialized bytes=562 
     Input split bytes=873 
     Combine input records=0 
     Combine output records=0 
     Reduce input groups=3 
     Reduce shuffle bytes=562 
     Reduce input records=3 
     Reduce output records=6 
     Spilled Records=6 
     Shuffled Maps =27 
     Failed Shuffles=0 
     Merged Map outputs=27 
     GC time elapsed (ms)=2785 
     CPU time spent (ms)=11670 
     Physical memory (bytes) snapshot=5282500608 
     Virtual memory (bytes) snapshot=28472725504 
     Total committed heap usage (bytes)=5977407488 
    Shuffle Errors 
     BAD_ID=0 
     CONNECTION=0 
     IO_ERROR=0 
     WRONG_LENGTH=0 
     WRONG_MAP=0 
     WRONG_REDUCE=0 
    File Input Format Counters 
     Bytes Read=864 
    File Output Format Counters 
     Bytes Written=130 
2016-04-27 19:59:40,409 INFO org.apache.hadoop.streaming.StreamJob (main): Output directory: s3://source123/output/ 

回答

2

映射器任務其輸入轉換成線和供給線的標準輸入該過程。

在這種情況下,必須多個輸入文件和你假設來自不同文件中的所有線是依次饋送(即,由文件的文件),但它們在平行容易處理,因此一個映射器(獲得幾個輸入文件)可能會通過順序分佈來重置計數器。

+0

那我該如何調整我的腳本呢? –

+0

第一個想法可能是將'prev_filename'轉換成以文件名作爲關鍵字的字典並測試字典是否具有關鍵字... – 2016-05-09 10:00:39