均方誤差（MSE）返回一個龐大的數字

我是Scala和Spark中的新成員。我使用迴歸代碼（基於此鏈接Spark official site上）：均方誤差（MSE）返回一個龐大的數字

import org.apache.spark.mllib.regression.LabeledPoint 
import org.apache.spark.mllib.regression.LinearRegressionModel 
import org.apache.spark.mllib.regression.LinearRegressionWithSGD 
import org.apache.spark.mllib.linalg.Vectors 

// Load and parse the data 
val data = sc.textFile("Year100") 
val parsedData = data.map { line => 
    val parts = line.split(',') 
    LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble))) 
}.cache() 

// Building the model 
val numIterations = 100 
val stepSize = 0.00000001 
val model = LinearRegressionWithSGD.train(parsedData, numIterations,stepSize) 

// Evaluate model on training examples and compute training error 
val valuesAndPreds = parsedData.map { point => 
    val prediction = model.predict(point.features) 
    (point.label, prediction) 
    } 
val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean() 
println("training Mean Squared Error = " + MSE)

，我使用這裏可以看到的數據集：Pastebin link。

所以我的問題是：爲什麼MSE等於889717.74（這是一個龐大的數字）？

編輯：正如論者建議，我想這些：

1）我改變了一步違約和MSE現在返回爲NaN的

2）如果我嘗試這個構造： LinearRegressionWithSGD.train （parsedData，numIterations，stepSize，intercept = True）spark-shell返回一個錯誤（error：not found：value True）

來源

2016-12-05 Ioannis Apomachos

[pyspark Linear Regression Example from official documentation - Bad results？]的可能副本（http://stackoverflow.com/questions/33842982/pyspark-linear-regression-example-from-official-documentation-bad-results） – 2016-12-05 22:24:25

您已經通過了一個微小的步長，並將迭代次數限制在100。您的參數可以更改的值是0.00000001 * 100 = 0.000001 。嘗試使用默認步長，我想這會解決它。

來源

2016-12-05 22:41:04 Tim

均方誤差（MSE）返回一個龐大的數字

回答

相關問題