2017-07-08 400 views
1

我無法保存使用ml包python/spark生成的隨機森林模型。Pyspark ML - 如何保存管道和RandomForestClassificationModel

>>> rf = RandomForestClassifier(labelCol="label", featuresCol="features") 
>>> pipeline = Pipeline(stages=early_stages + [rf]) 
>>> model = pipeline.fit(trainingData) 
>>> model.save("fittedpipeline") 

Traceback (most recent call last): File "", line 1, in AttributeError: 'PipelineModel' object has no attribute 'save'

>>> rfModel = model.stages[8] 
>>> print(rfModel) 

RandomForestClassificationModel(UID = rfc_46c07f6d7ac8)用20種樹木

>> rfModel.save("rfmodel") 

Traceback (most recent call last): File "", line 1, in AttributeError: 'RandomForestClassificationModel' object has no attribute 'save'**

另外,通過通 'SC' 嘗試作爲第一個參數,以節省方法。

+0

您正在使用什麼版本的火花? – eliasah

+0

我正在使用spark 1.6.0。不幸的是,由於某些原因,我無法升級到更高版本。我們是否有一些解決方法可以在1.6.0中保存模型? –

+0

pyspark <2.0.0沒有任何開箱即用的功能。 – eliasah

回答

1

您的代碼存在主要問題,我相信您使用的是2.0.0版之前的Apache Spark版本。因此,save尚不適用於Pipeline API。

這裏是一個完整的例子,從官方文檔複合。讓我們先來創建我們的管道:

from pyspark.ml import Pipeline 
from pyspark.ml.classification import RandomForestClassifier 
from pyspark.ml.feature import IndexToString, StringIndexer, VectorIndexer 

# Load and parse the data file, converting it to a DataFrame. 
data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt") 

# Index labels, adding metadata to the label column. 
# Fit on whole dataset to include all labels in index. 
labelIndexer = StringIndexer(inputCol="label", outputCol="indexedLabel") 
labels = labelIndexer.fit(data).labels 

# Automatically identify categorical features, and index them. 
# Set maxCategories so features with > 4 distinct values are treated as continuous. 
featureIndexer = VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4) 

early_stages = [labelIndexer, featureIndexer] 

# Split the data into training and test sets (30% held out for testing) 
(trainingData, testData) = data.randomSplit([0.7, 0.3]) 

# Train a RandomForest model. 
rf = RandomForestClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures", numTrees=10) 

# Convert indexed labels back to original labels. 
labelConverter = IndexToString(inputCol="prediction", outputCol="predictedLabel", labels=labels) 

# Chain indexers and forest in a Pipeline 
pipeline = Pipeline(stages= early_stages + [rf, labelConverter]) 

# Train model. This also runs the indexers. 
model = pipeline.fit(trainingData) 

現在,您可以節省您的管道:

>>> model.save("/tmp/rf") 
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". 
SLF4J: Defaulting to no-operation (NOP) logger implementation 
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. 

您還可以保存RF模型:

>>> rf_model = model.stages[2] 
>>> print(rf_model) 
RandomForestClassificationModel (uid=rfc_b368678f4122) with 10 trees 
>>> rf_model.save("/tmp/rf_2") 
相關問題