星火ML管道與隨機森林時間過長上20MB的數據集

我使用星火ML運行一些ML實驗，以及20MB（Poker dataset）和隨機森林與參數網格的小數據集，它需要1小時30分完。與scikit學習類似，它需要少得多。星火ML管道與隨機森林時間過長上20MB的數據集

在環境方面，我用2個奴，15GB存儲器中的每個，24個核心測試。我認爲它不應該花那麼長時間，我想知道問題出在我的代碼中，因爲我對Spark很新。

這就是：提前

df = pd.read_csv(http://archive.ics.uci.edu/ml/machine-learning-databases/poker/poker-hand-testing.data) 
dataframe = sqlContext.createDataFrame(df) 

train, test = dataframe.randomSplit([0.7, 0.3]) 

columnTypes = dataframe.dtypes 

for ct in columnTypes: 
    if ct[1] == 'string' and ct[0] != 'label': 
     categoricalCols += [ct[0]] 
    elif ct[0] != 'label': 
     numericCols += [ct[0]] 

stages = [] 

for categoricalCol in categoricalCols: 

    stringIndexer = StringIndexer(inputCol=categoricalCol, outputCol=categoricalCol+"Index") 

stages += [stringIndexer] 

assemblerInputs = map(lambda c: c + "Index", categoricalCols) + numericCols 

assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features") 

stages += [assembler] 

labelIndexer = StringIndexer(inputCol='label', outputCol='indexedLabel', handleInvalid='skip') 

stages += [labelIndexer] 

estimator = RandomForestClassifier(labelCol="indexedLabel", featuresCol="features") 

stages += [estimator] 

parameters = {"maxDepth" : [3, 5, 10, 15], "maxBins" : [6, 12, 24, 32], "numTrees" : [3, 5, 10]} 

paramGrid = ParamGridBuilder() 
for key, value in parameters.iteritems(): 
    paramGrid.addGrid(estimator.getParam(key), value) 
estimatorParamMaps = (paramGrid.build()) 

pipeline = Pipeline(stages=stages) 

crossValidator = CrossValidator(estimator=pipeline, estimatorParamMaps=estimatorParamMaps, evaluator=MulticlassClassificationEvaluator(labelCol='indexedLabel', predictionCol='prediction', metricName='f1'), numFolds=3) 

pipelineModel = crossValidator.fit(train) 

predictions = pipelineModel.transform(test) 

evaluator = pipeline.getEvaluator().evaluate(predictions)

謝謝，任何意見/建議高度讚賞:)

來源

2017-07-02 Larissa Leite

交叉驗證是一個沉重而漫長的任務，因爲它是成正比的3的組合超參數乘以訓練每個模型花費的時間的倍數。您可能希望將每個示例的數據緩存起來，但仍不會爲您帶來太多時間。我相信火花對於這個數據量來說是一種矯枉過正。您可能希望使用scikit學習，也許可以使用https://github.com/databricks/spark-sklearn進行分佈式本地模型培訓 – eliasah

hi @eliasah感謝您的評論。事實上，我正在用spark-sklearn做到這一點，並取得了良好的結果。然而，我只是想比較sklearn和spark之間的執行時間，但這些數字對我來說似乎很奇怪，因爲雖然一個人需要幾秒鐘的時間，另一個需要幾個小時 –

因爲spark會分別依次學習每個模型，並假設數據是分佈式的且大。 – eliasah

下可能無法完全解決您的問題，但它應該給你一些指針開始。

您面臨的第一個問題是數據量與資源不匹配。

這意味着，因爲你是一個並行本地集合（熊貓據幀），火花將使用默認並行的結構。其中最有可能導致每個分區的分區數小於0.5mb的分區數爲48。

的第二個問題是有關Spark中使用的樹模型代價高昂的優化/近似技術（火花不會與小文件，也沒有小分區做的很好）。

星火樹模型使用一些技巧，以最佳鬥連續變量。對於小數據來說，獲得準確的分割更便宜。它在這種情況下主要使用近似分位數。

通常情況下，在一臺機器的框架方案，比如scikit，樹模型採用獨特的特徵值連續特徵分割候選人的最佳擬合計算。而在Apache Spark中，樹模型使用每個特徵的分位數作爲分割候選。

另外要補充一點，你不應該忘記，以及交叉驗證是一個沉重而長期的任務，因爲它是成正比的3超參數的組合，次摺疊的次數所花費的時間來訓練每個模型（ GridSearch方法）。您可能希望將每個示例的數據緩存起來，但仍不會爲您帶來太多時間。我相信火花對於這個數據量來說是一種矯枉過正。您可能希望使用scikit學習，也可以使用spark-sklearn進行分佈式本地模型培訓。

星火將以獨立和順序學習每個模型的假設數據分佈和大的。

當然，您可以使用基於柱狀數據的文件格式（例如實木複合地板和調整火花本身等）來優化性能，在此處討論它太寬泛了。

您可以在此以下博文詳細瞭解如何與火花mllib樹模型的可擴展性：

Scalable Decision Trees in MLlib

來源

2017-07-04 12:11:29 eliasah

星火ML管道與隨機森林時間過長上20MB的數據集

回答

相關問題