2017-06-13 92 views
0

如何將變換爲DTO列表爲Spark ML輸入數據集格式使用Java的Spark MLlib分類輸入格式

我DTO:

public class MachineLearningDTO implements Serializable { 
    private double label; 
    private double[] features; 

    public MachineLearningDTO() { 
    } 

    public MachineLearningDTO(double label, double[] features) { 
     this.label = label; 
     this.features = features; 
    } 

    public double getLabel() { 
     return label; 
    } 

    public void setLabel(double label) { 
     this.label = label; 
    } 

    public double[] getFeatures() { 
     return features; 
    } 

    public void setFeatures(double[] features) { 
     this.features = features; 
    } 
} 

和代碼:

Dataset<MachineLearningDTO> mlInputDataSet = spark.createDataset(mlInputData, Encoders.bean(MachineLearningDTO.class)); 
LogisticRegression logisticRegression = new LogisticRegression(); 
LogisticRegressionModel model = logisticRegression.fit(MLUtils.convertMatrixColumnsToML(mlInputDataSet)); 

的代碼執行後我得到:

java.lang.IllegalArgumentException異常:要求失敗:列 功能必須是[email protected] 類型,但行爲ually ArrayType(DoubleType,false)。

如果使用代碼更改爲org.apache.spark.ml.linalg.VectorUDT:

VectorUDT vectorUDT = new VectorUDT(); 
vectorUDT.serialize(Vectors.dense(......)); 

然後我得到:

java.lang.UnsupportedOperationException:無法推斷類型爲 org.apache.spark.ml.linalg.VectorUDT,因爲它不符合bean的要求

在 org.apache.spark .sql.catalyst.JavaTypeInference $ .ORG $阿帕奇$火花$ SQL $ $催化劑$$ JavaTypeInference serializerFor(JavaTypeInference.scala:437)

回答

1

我想通了,以防萬一有人還堅持了下來,我寫了簡單的轉換器,它的工作原理如下:

private Dataset<Row> convertToMlInputFormat(List< MachineLearningDTO> data) { 
    List<Row> rowData = data.stream() 
      .map(dto -> 
        RowFactory.create(dto.getLabel() ? 1.0d : 0.0d, Vectors.dense(dto.getFeatures()))) 
      .collect(Collectors.toList()); 
    StructType schema = new StructType(new StructField[]{ 
      new StructField("label", DataTypes.DoubleType, false, Metadata.empty()), 
      new StructField("features", new VectorUDT(), false, Metadata.empty()), 
    }); 

    return spark.createDataFrame(rowData, schema); 
}