此python3程序嘗試使用map/reduce從文本文件生成單詞的頻率列表。我想知道如何在第二個減速器的產量表中對單詞計數進行排序,表示爲「count」,以便最後出現最大的計數值。目前,該成果的尾部看起來就像這樣:映射/減少計數的兩階段排序
"0002" "wouldn"
"0002" "wrap"
"0002" "x"
"0002" "xxx"
"0002" "young"
"0002" "zone"
對於情況下,我通過任何字的文本文件到python3程序是這樣的:
python MapReduceWordFreqCounter.py book.txt
這裏是MapReduceWordFreqCounter.py
代碼:
from mrjob.job import MRJob
from mrjob.step import MRStep
import re
# ignore whitespace characters
WORD_REGEXP = re.compile(r"[\w']+")
class MapReduceWordFreqCounter(MRJob):
def steps(self):
return [
MRStep(mapper=self.mapper_get_words,
reducer=self.reducer_count_words),
MRStep(mapper=self.mapper_make_counts_key,
reducer = self.reducer_output_words)
]
def mapper_get_words(self, _, line):
words = WORD_REGEXP.findall(line)
for word in words:
yield word.lower(), 1
def reducer_count_words(self, word, values):
yield word, sum(values)
def mapper_make_counts_key(self, word, count):
yield str(count).rjust(4,'0'), word
def reducer_output_words(self, count, words):
for word in words:
yield count, word
if __name__ == '__main__':
MapReduceWordFreqCounter.run()