亞馬遜MapReduce與我自己的reducer流式傳輸

我寫了一個簡單的地圖，並減少在Python中的程序來計算每個句子的數字，然後將相同的數字組合在一起。即假設句子1有10個單詞，句子2有17個單詞，句子3有10個單詞。最終的結果將是：亞馬遜MapReduce與我自己的reducer流式傳輸

10 \t 2 
17 \t 1

映射器功能是：

import sys 
    import re 

    pattern = re.compile("[a-zA-Z][a-zA-Z0-9]*") 
    for line in sys.stdin: 

     word = str(len(line.split())) # calculate how many words for each line 
     count = str(1) 
     print "%s\t%s" % (word, count)

的減速功能是：

import sys 


    current_word = None 
    current_count = 0 
    word = None 

    for line in sys.stdin: 
     line = line.strip() 
     word, count = line.split('\t') 
     try: 
      count = int(count) 
      word = int(word) 
     except ValueError: 
      continue 
     if current_word == word: 
      current_count += count 
     else: 
      if current_word: 
       print "%s\t%s" % (current_word, current_count) 
      current_count = count 
      current_word = word 

    if current_word == word: 
     print "%s\t%s" %(current_word, current_count)

我在我的本地機器上測試了第200行的file： head -n 200句子.txt | python mapper.py |排序| python reducer.py 結果是正確的。然後我使用Amazon MapReduce流媒體服務，它在縮小步驟失敗。於是我將打印機中的打印功能更改爲：

print "LongValueSum" + word + "\t" + "1"

這適合於mapreduce流服務中的默認聚合。在這種情況下，我不需要reducer.py函數。我從大文件句子.txt得到最終結果。但我不知道爲什麼我的reducer.py函數失敗。謝謝！

來源

2014-10-26 ohmygoddess

您可能想要查看mrjob：https：//pythonhosted.org/mrjob/這是一種使用Python編寫MapReduce作業的非常方便的方法。可以在本地開發一個小樣本數據集，然後使用亞馬遜的Elastic-Mapreduce對命令行進行輕微調整，然後將其擴大到更大的數據集。 – 2014-10-27 02:18:34

Got it！一個「愚蠢」的錯誤。當我測試它時，我使用了諸如python mapper.py之類的東西。但對於mapreduce，我需要使它可執行。所以只需加上

# !/usr/bin/env python

在開始。

來源

2014-10-27 05:12:28 ohmygoddess

亞馬遜MapReduce與我自己的reducer流式傳輸

回答

相關問題