2015-07-03 83 views
3

想這些都是我的數據:如何在每行添加行號?

‘Maps‘ and ‘Reduces‘ are two phases of solving a query in HDFS. 
‘Map’ is responsible to read data from input location. 
it will generate a key value pair. 
that is, an intermediate output in local machine. 
’Reducer’ is responsible to process the intermediate. 
output received from the mapper and generate the final output. 

,我想一個號碼添加到每一行類似下面的輸出:

1,‘Maps‘ and ‘Reduces‘ are two phases of solving a query in HDFS. 
2,‘Map’ is responsible to read data from input location. 
3,it will generate a key value pair. 
4,that is, an intermediate output in local machine. 
5,’Reducer’ is responsible to process the intermediate. 
6,output received from the mapper and generate the final output. 

它們保存到文件中。

我已經試過:

object DS_E5 { 
    def main(args: Array[String]): Unit = { 

    var i=0 
    val conf = new SparkConf().setAppName("prep").setMaster("local") 
    val sc = new SparkContext(conf) 
    val sample1 = sc.textFile("data.txt") 
    for(sample<-sample1){ 
     i=i+1 
     val ss=sample.map(l=>(i,sample)) 
     println(ss) 
    } 
} 
} 

,但它的輸出就像是自爆:

Vector((1,‘Maps‘ and ‘Reduces‘ are two phases of solving a query in HDFS.)) 
... 

如何編輯我的代碼生成像我最喜歡輸出的輸出?

+1

問題也出現在這裏逐字:HTTP://bigdataanalyticsnews.com/hadoop-interview-questions-mapreduce/ – Madoc

回答

5

zipWithIndex是你需要在這裏。它從RDD[T]映射到RDD[(T, Long)]通過添加對的第二個位置上的索引。

sample1 
    .zipWithIndex() 
    .map { case (line, i) => i.toString + ", " + line } 

或使用字符串插值(見@ DanielC.Sobral評論)

sample1 
    .zipWithIndex() 
    .map { case (line, i) => s"$i, $line" } 
+0

可能需要'I + 1'到如果算上從1開始 – jwvh

+0

謝謝@ zero323,這是可以的,但還有括號(1,行),我想刪除這些括號。 – AHAD

+0

我不確定我是否理解。你的輸出是RDD [String]嗎? – zero323

2

通過調用val sample1 = sc.textFile("data.txt")您要創建一個新的RDD

如果您需要只是一個輸出,你可以嘗試使用下面的代碼:

sample1.zipWithIndex().foreach(f => println(f._2 + ", " + f._1))

基本上,通過使用此代碼,你會做到這一點:

  1. 使用.zipWithIndex()將返回新的RDD[(T, Long)],其中(T, Long)Tuple,T是以前的RDD元素數據類型(java.lang.String,我相信),Long是RDD中元素的索引。
  2. 您執行了轉換,現在您需要制定一個動作foreach,這種情況下,很適合。基本上是這樣做的:它將語句應用於當前RDD中的每個元素,因此我們只需調用格式爲println的格式。