我們需要在spark中計算大量數據集合中的距離矩陣,如jaccard。 面對幾個問題。請幫助我們提供指導。在Apache Spark中使用map函數進行巨大的操作
1期
import info.debatty.java.stringsimilarity.Jaccard;
//sample Data set creation
List<Row> data = Arrays.asList(
RowFactory.create("Hi I heard about Spark", "Hi I Know about Spark"),
RowFactory.create("I wish Java could use case classes","I wish C# could use case classes"),
RowFactory.create("Logistic,regression,models,are,neat","Logistic,regression,models,are,neat"));
StructType schema = new StructType(new StructField[] {new StructField("label", DataTypes.StringType, false,Metadata.empty()),
new StructField("sentence", DataTypes.StringType, false,Metadata.empty()) });
Dataset<Row> sentenceDataFrame = spark.createDataFrame(data, schema);
// Distance matrix object creation
Jaccard jaccard=new Jaccard();
//Working on each of the member element of dataset and applying distance matrix.
Dataset<String> sentenceDataFrame1 =sentenceDataFrame.map(
(MapFunction<Row, String>) row -> "Name: " + jaccard.similarity(row.getString(0),row.getString(1)),Encoders.STRING()
);
sentenceDataFrame1.show();
沒有編譯時錯誤。但是,讓運行時異常,如:
org.apache.spark.SparkException:任務不可序列
第2期
此外,我們需要找到哪對是有最高得分,我們需要聲明一些變量。此外,我們還需要執行其他計算,我們面臨着很多困難。
即使我嘗試在MapBlock中聲明一個像counter這樣的簡單變量,我們也無法捕獲增加的值。如果我們在Map塊之外聲明,我們會收到很多編譯時錯誤。
int counter=0;
Dataset<String> sentenceDataFrame1 =sentenceDataFrame.map(
(MapFunction<Row, String>) row -> {
System.out.println("Name: " + row.getString(1));
//int counter = 0;
counter++;
System.out.println("Counter: " + counter);
return counter+"";
},Encoders.STRING()
);
請給我們指點。 謝謝。