火花邏輯迴歸中的空係數

我想將一些機器學習算法應用於Spark（Java）中的數據集。當嘗試的 Logistic regression in spark 的例子中CoefficientMatrixis是這樣的： 3 x 4 CSCMatrix (1,2) -0.7889290490451877 (0,3) 0.2989598305580243 (1,3) -0.36583869680195286 Intercept: [0.07898530675801645,-0.14799468898820128,0.06900938223018485]火花邏輯迴歸中的空係數

如果我沒有錯，
(1,2) -0.7889290490451877 (0,3) 0.2989598305580243 (1,3) -0.36583869680195286表示「最適合」的模式，每類。

現在，當我想我的數據集，其中有4個不同的類別和8192的功能，該係數是 4 x 8192 CSCMatrix Intercept: [1.3629726436521425,0.7373644161565249,-1.0762606057817274,-1.0240764540269398]

我不熟悉的Logistic迴歸算法，所以我不明白爲什麼沒有「最合適」？

我的代碼

HashingTF hashingTF = new HashingTF() 
       .setInputCol("listT") 
       .setOutputCol("rawFeatures") 
       .setNumFeatures(8192) ; 
Dataset<Row> featurizedData = hashingTF.transform(ReviewRawData); 
     featurizedData.show(); 
     IDF idf = new IDF().setInputCol("rawFeatures").setOutputCol("features"); 
     IDFModel idfModel = idf.fit(featurizedData); 
     Dataset<Row> rescaledData = idfModel.transform(featurizedData); 
//add the label col based on some conditions 
     Dataset<Row> lebeldata = rescaledData.withColumn("label",newCol); 
     lebeldata.groupBy("label").count().show(); 
Dataset<Row>[] splits = lebeldata.select("label","features").randomSplit(new double[]{0.7, 0.3}); 
     Dataset<Row> train = splits[0]; 
     Dataset<Row> test = splits[1]; 

     LogisticRegression lr = new LogisticRegression() 
       .setMaxIter(10) 
       .setRegParam(0.3) 
       .setElasticNetParam(0.8) 
       .setLabelCol("label") 
       .setFeaturesCol("features") 
       .setFamily("multinomial"); 

     LogisticRegressionModel lrModel = lr.fit(train); 
     System.out.println("Coefficients: \n" 
       + lrModel.coefficientMatrix() + " \nIntercept: " + 
     lrModel.interceptVector());

我的數據集

+-----+-----+ 
|label|count| 
+-----+-----+ 
| 0.0| 6455| 
| 1.0| 3360| 
| 3.0| 599| 
| 2.0| 560| 
+-----+-----+

而當評價分類，僅僅是第一類的預測。

Class 0.000000 precision = 0.599511 
Class 0.000000 recall = 1.000000 
Class 0.000000 F1 score = 0.749618 
Class 1.000000 precision = 0.000000 
Class 1.000000 recall = 0.000000 
Class 1.000000 F1 score = 0.000000 
Class 2.000000 precision = 0.000000 
Class 2.000000 recall = 0.000000 
Class 2.000000 F1 score = 0.000000 
Class 3.000000 precision = 0.000000 
Class 3.000000 recall = 0.000000 
Class 3.000000 F1 score = 0.000000

順便說一句，我申請的同一數據集與上述相同的步驟在火花另一臺機器學習算法，它工作正常！

來源

2017-09-09 Mahmoud Murad

我在Spark 2.1.1中從spark.ml有LogisticRegression類似的問題，並刪除.setElasticNetParam(0.8)爲我工作。

另一種可能性是，您的數據集中存在高槓杆點（特徵範圍內的異常值），這會使預測偏斜。

來源

2017-09-09 16:16:04 Chang

謝謝，你能解釋一下這個參數是什麼嗎？當我們刪除它發生了什麼？再次感謝。 –

我的猜測是'setElasticNetParam（0.8）'將強制邏輯迴歸在L1和L2懲罰之間找到一個平衡點，並且在大多數情況下，L1懲罰會將回歸係數推到0並打破分類器。 – Chang

火花邏輯迴歸中的空係數

回答

相關問題