WordCount示例與每個文件的計數

我遇到問題以獲取每個文件的單詞出現總數的細目。例如，我有四個文本文件（t1，t2，t3，t4）。字w1在文件t2中是兩次，並且在t4中一次，總共出現三次。我想在輸出文件中寫入相同的信息。我得到每個文件中的單詞總數，但不能像上面那樣得到我想要的結果。WordCount示例與每個文件的計數

這是我的地圖課。

import java.io.IOException; 
import java.util.*; 

import org.apache.hadoop.io.*; 
import org.apache.hadoop.mapreduce.*; 
//line added 
import org.apache.hadoop.mapreduce.lib.input.*; 

public class Map extends Mapper<LongWritable, Text, Text, IntWritable> { 
private final static IntWritable one = new IntWritable(1); 
private Text word = new Text(); 
private String pattern= "^[a-z][a-z0-9]*$"; 

public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { 
    String line = value.toString(); 
    StringTokenizer tokenizer = new StringTokenizer(line); 
    //line added 
    InputSplit inputSplit = context.getInputSplit(); 
    String fileName = ((FileSplit) inputSplit).getPath().getName(); 

    while (tokenizer.hasMoreTokens()) { 
     word.set(tokenizer.nextToken()); 
     String stringWord = word.toString().toLowerCase(); 
     if ((stringWord).matches(pattern)){ 
      //context.write(new Text(stringWord), one); 
      context.write(new Text(stringWord), one); 
      context.write(new Text(fileName), one); 
      //System.out.println(fileName); 
      } 
     } 
    } 
}

來源

2015-10-06 VD007

如果您想要爲每個文件分別生成結果，請將作業運行四次。如果你想要結合的結果然後提供所有的文件作爲輸入，你需要使用MultipleInput。 – YoungHobbit

結果的第一部分是可以的（這是所有文件中所有單詞的總髮生次數）。但我想用文件名來分解。 like，w1：3次出現（t2 x兩次，t1 x一次） – VD007

這可以通過編寫word爲key和filename爲value實現。現在在reducer中爲每個文件初始化單獨的計數器並更新它們。一旦所有的值迭代了一個特定的鍵，然後將每個文件的計數器寫入上下文。

這裏你知道你只有四個文件，所以你可以硬編碼四個變量。請記住，您需要重置您在Reducer中處理的每個新密鑰的變量。

如果文件數量多於您可以使用地圖。在地圖上，filename將爲key並繼續更新value。

來源

2015-10-06 12:39:48 YoungHobbit

在映射器的輸出中，我們可以將文本文件名稱設置爲鍵，並將文件中的每一行設置爲值。這個reducer給你的文件名稱單詞和相應的計數。

public class Reduce extends Reducer<Text, Text, Text, Text> { 
    HashMap<String, Integer>input = new HashMap<String, Integer>(); 

    public void reduce(Text key, Iterable<Text> values , Context context) 
    throws IOException, InterruptedException { 
     int sum = 0; 
     for(Text val: values){ 
      String word = val.toString(); -- processing each row 
      String[] wordarray = word.split(' '); -- assuming the delimiter is a space 
      for(int i=0 ; i<wordarray.length; i++) 
      { 
      if(input.get(wordarray[i]) == null){ 
      input.put(wordarray[i],1);} 
      else{ 
      int value =input.get(wordarray[i]) +1 ; 
      input.put(wordarray[i],value); 
      } 
      }  

     context.write(new Text(key), new Text(input.toString())); 
    }

來源

2015-10-07 08:24:19 madhu

嗨，謝謝。在Map類中，我無法傳遞文件名作爲變量 – VD007

我正在編輯我的上面的問題與地圖calss – VD007

無法得到爲什麼你不能傳遞文件名作爲變量... – madhu

WordCount示例與每個文件的計數

回答

相關問題