2012-04-13 84 views
2

我一直在嘗試使用Java-Client'HECTOR'對Cassandra中存儲的數據運行簡單的map-reduce作業。使用Hector對Cassandra數據運行mapreduce

我已經成功運行了這個美麗的blogpost中解釋的hadoop-wordcount示例。我也讀過Hadoop Support文章。

但是我想要做的是在實現方面有點不同(wordcount示例使用腳本,其中提到了mapreduce-site.xml)。我希望有人幫助我理解如何在分佈式模式下運行map-reduce作業,而不是從cassandra數據上的'HECTOR'本地運行。

我的代碼在本地模式下運行map-reduce作業SUCCESSFULLY。但我想要的是在分佈式模式下運行它們並將結果作爲cassandra密鑰空間中的新ColumnFamily寫入。

我可能要設置這個地方
$PATH_TO_HADOOP/conf/mapred-site.xml
用於在分佈式模式下運行它(如在上述博文中提到),但我不知道在哪裏。

這裏是我的代碼

public class test_forum implements Tool { 

private String KEYSPACE = "test_forum"; 
private String COLUMN_FAMILY ="posts"; 
private String OUTPUT_COLUMN_FAMILY = "output_post_count"; 
private static String CONF_COLUMN_NAME = "text"; 


public int run(String[] strings) throws Exception { 

    Configuration conf = new Configuration(); 

    conf.set(CONF_COLUMN_NAME, "text"); 
    Job job = new Job(conf,"test_forum"); 

    job.setJarByClass(test_forum.class); 
    job.setMapperClass(TokenizerMapper.class); 
    job.setReducerClass(ReducerToCassandra.class); 

    job.setMapOutputKeyClass(Text.class); 
    job.setMapOutputValueClass(IntWritable.class); 

    job.setOutputKeyClass(ByteBuffer.class); 
    job.setOutputValueClass(List.class); 

    job.setOutputFormatClass(ColumnFamilyOutputFormat.class); 
    job.setInputFormatClass(ColumnFamilyInputFormat.class); 


    System.out.println("Job Set"); 


    ConfigHelper.setRpcPort(job.getConfiguration(), "9160"); 
    ConfigHelper.setInitialAddress(job.getConfiguration(), "localhost"); 
    ConfigHelper.setPartitioner(job.getConfiguration(), "org.apache.cassandra.dht.RandomPartitioner"); 

    ConfigHelper.setInputColumnFamily(job.getConfiguration(),KEYSPACE,COLUMN_FAMILY); 
    ConfigHelper.setOutputColumnFamily(job.getConfiguration(), KEYSPACE, OUTPUT_COLUMN_FAMILY); 

    SlicePredicate predicate = new SlicePredicate().setColumn_names(Arrays.asList(ByteBufferUtil.bytes("text"))); 

    ConfigHelper.setInputSlicePredicate(job.getConfiguration(),predicate); 

    System.out.println("running job now.."); 

    boolean success = job.waitForCompletion(true); 

    return success ? 0:1; //To change body of implemented methods use File | Settings | File Templates. 

} 



public static class TokenizerMapper extends Mapper<ByteBuffer, SortedMap<ByteBuffer, IColumn>, Text, IntWritable> 
{ 
    private final static IntWritable one = new IntWritable(1); 
    private Text word = new Text(); 
    private ByteBuffer sourceColumn; 
    protected void setup(org.apache.hadoop.mapreduce.Mapper.Context context) 
    throws IOException, InterruptedException 
    { 
     sourceColumn = ByteBufferUtil.bytes(context.getConfiguration().get(CONF_COLUMN_NAME)); 
    } 

    public void map(ByteBuffer key, SortedMap<ByteBuffer, IColumn> columns, Context context) throws IOException, InterruptedException 
    { 



     IColumn column = columns.get(sourceColumn); 

     if (column == null) { 
      return; 
     } 

     String value = ByteBufferUtil.string(column.value()); 
     System.out.println("read " + key + ":" + value + " from " + context.getInputSplit()); 

     StringTokenizer itr = new StringTokenizer(value); 

     while (itr.hasMoreTokens()) 
     { 
      word.set(itr.nextToken()); 
      context.write(word, one); 
     } 
    } 


} 

    public static class ReducerToCassandra extends Reducer<Text, IntWritable, ByteBuffer, List<Mutation>> 
{ 
    private ByteBuffer outputKey; 

    public void reduce(Text word, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException 
    { 
     int sum = 0; 

     byte[] keyBytes = word.getBytes(); 
     outputKey = ByteBuffer.wrap(Arrays.copyOf(keyBytes, keyBytes.length)); 


     for (IntWritable val : values) 
      sum += val.get(); 

     System.out.println(word.toString()+" -> "+sum); 
     context.write(outputKey, Collections.singletonList(getMutation(word, sum))); 

    } 

    private static Mutation getMutation(Text word, int sum) 
    { 
     Column c = new Column(); 
     c.setName(Arrays.copyOf(word.getBytes(), word.getLength())); 
     c.setValue(ByteBufferUtil.bytes(String.valueOf(sum))); 
     c.setTimestamp(System.currentTimeMillis()); 

     Mutation m = new Mutation(); 
     m.setColumn_or_supercolumn(new ColumnOrSuperColumn()); 
     m.column_or_supercolumn.setColumn(c); 
     System.out.println("Mutating"); 
     return m; 

    } 

} 




public static void main(String[] args) throws Exception, ClassNotFoundException, InterruptedException { 

    System.out.println("Working..!"); 

    int ret=ToolRunner.run(new Configuration(), new test_forum(), args); 

    System.out.println("Done..!"); 

    System.exit(ret); 

} 

}

這裏是警告我得到:

WARN - JobClient     - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 
WARN - JobClient     - No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String). 

但代碼運行成功執行的map-reduce任務,但我不知道它在哪裏寫入數據。

編輯:我沒有在cassandra輸出創建columnFamily。因此它不是寫作。所以現在唯一的問題是如何在分佈式模式下運行它。

謝謝。

回答

2

您是否在課堂上創建了一個jar包?

Hadoop需要一個jar包才能在羣集上傳播您的作業類。如果你沒有,它會解釋「無作業jar文件集」錯誤,以及爲什麼你不能在分佈式模式下運行它。注意使用「hadoop jar ...」命令啓動你的工作,並添加jar依賴項(至少apache-cassandra!)。提交工作時,您的cassandra服務器必須啓動並聽取節儉端口。

順便說一下,Hadoop和Cassandra不需要Hector。 ColumnFamilyInputFormat(和ColumnFamilyOutputFormat)不知道如何自己讀取(和寫入)到Cassandra的數據。這就是爲什麼你必須配置RpcPort,InitialAdressPartionner(你做到了)。

最後注意:ColumnFamilyOutputFormat不會創建您的輸出列族,它必須已經存在,否則寫入時會出錯。

希望這有助於

伯努瓦

相關問題