Pyspark ML - 如何保存管道和RandomForestClassificationModel

我無法保存使用ml包python/spark生成的隨機森林模型。Pyspark ML - 如何保存管道和RandomForestClassificationModel

>>> rf = RandomForestClassifier(labelCol="label", featuresCol="features") 
>>> pipeline = Pipeline(stages=early_stages + [rf]) 
>>> model = pipeline.fit(trainingData) 
>>> model.save("fittedpipeline")

Traceback (most recent call last): File "", line 1, in AttributeError: 'PipelineModel' object has no attribute 'save'

>>> rfModel = model.stages[8] 
>>> print(rfModel)

RandomForestClassificationModel（UID = rfc_46c07f6d7ac8）用20種樹木

>> rfModel.save("rfmodel")

Traceback (most recent call last): File "", line 1, in AttributeError: 'RandomForestClassificationModel' object has no attribute 'save'**

另外，通過通 'SC' 嘗試作爲第一個參數，以節省方法。

來源

2017-07-08 Nasir Mahmood

您正在使用什麼版本的火花？ – eliasah

我正在使用spark 1.6.0。不幸的是，由於某些原因，我無法升級到更高版本。我們是否有一些解決方法可以在1.6.0中保存模型？ –

pyspark <2.0.0沒有任何開箱即用的功能。 – eliasah

您的代碼存在主要問題，我相信您使用的是2.0.0版之前的Apache Spark版本。因此，save尚不適用於Pipeline API。

這裏是一個完整的例子，從官方文檔複合。讓我們先來創建我們的管道：

from pyspark.ml import Pipeline 
from pyspark.ml.classification import RandomForestClassifier 
from pyspark.ml.feature import IndexToString, StringIndexer, VectorIndexer 

# Load and parse the data file, converting it to a DataFrame. 
data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt") 

# Index labels, adding metadata to the label column. 
# Fit on whole dataset to include all labels in index. 
labelIndexer = StringIndexer(inputCol="label", outputCol="indexedLabel") 
labels = labelIndexer.fit(data).labels 

# Automatically identify categorical features, and index them. 
# Set maxCategories so features with > 4 distinct values are treated as continuous. 
featureIndexer = VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4) 

early_stages = [labelIndexer, featureIndexer] 

# Split the data into training and test sets (30% held out for testing) 
(trainingData, testData) = data.randomSplit([0.7, 0.3]) 

# Train a RandomForest model. 
rf = RandomForestClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures", numTrees=10) 

# Convert indexed labels back to original labels. 
labelConverter = IndexToString(inputCol="prediction", outputCol="predictedLabel", labels=labels) 

# Chain indexers and forest in a Pipeline 
pipeline = Pipeline(stages= early_stages + [rf, labelConverter]) 

# Train model. This also runs the indexers. 
model = pipeline.fit(trainingData)

現在，您可以節省您的管道：

>>> model.save("/tmp/rf") 
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". 
SLF4J: Defaulting to no-operation (NOP) logger implementation 
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.

您還可以保存RF模型：

>>> rf_model = model.stages[2] 
>>> print(rf_model) 
RandomForestClassificationModel (uid=rfc_b368678f4122) with 10 trees 
>>> rf_model.save("/tmp/rf_2")

來源

2017-07-08 08:11:28 eliasah

Pyspark ML - 如何保存管道和RandomForestClassificationModel

回答

相關問題