2016-12-05 123 views
0

我是Scala和Spark中的新成員。我使用迴歸代碼(基於此鏈接Spark official site上):均方誤差(MSE)返回一個龐大的數字

import org.apache.spark.mllib.regression.LabeledPoint 
import org.apache.spark.mllib.regression.LinearRegressionModel 
import org.apache.spark.mllib.regression.LinearRegressionWithSGD 
import org.apache.spark.mllib.linalg.Vectors 

// Load and parse the data 
val data = sc.textFile("Year100") 
val parsedData = data.map { line => 
    val parts = line.split(',') 
    LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble))) 
}.cache() 

// Building the model 
val numIterations = 100 
val stepSize = 0.00000001 
val model = LinearRegressionWithSGD.train(parsedData, numIterations,stepSize) 

// Evaluate model on training examples and compute training error 
val valuesAndPreds = parsedData.map { point => 
    val prediction = model.predict(point.features) 
    (point.label, prediction) 
    } 
val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean() 
println("training Mean Squared Error = " + MSE) 

,我使用這裏可以看到的數據集:Pastebin link

所以我的問題是:爲什麼MSE等於889717.74(這是一個龐大的數字)?

編輯:正如論者建議,我想這些:

1)我改變了一步違約和MSE現在返回爲NaN的

2)如果我嘗試這個構造: LinearRegressionWithSGD.train (parsedData,numIterations,stepSize,intercept = True)spark-shell返回一個錯誤(error:not found:value True)

+0

[pyspark Linear Regression Example from official documentation - Bad results?]的可能副本(http://stackoverflow.com/questions/33842982/pyspark-linear-regression-example-from-official-documentation-bad-results) – 2016-12-05 22:24:25

回答

0

您已經通過了一個微小的步長,並將迭代次數限制在100。您的參數可以更改的值是0.00000001 * 100 = 0.000001 。嘗試使用默認步長,我想這會解決它。