使用mahout進行Kmeans聚類

我正在嘗試在數據使用上執行kmeans算法。運行時必須傳遞的選項需要一個到初始集羣的路徑。任何人都可以告訴我，即使在啓動算法之前，我們如何擁有初始簇？使用mahout進行Kmeans聚類

bin/mahout kmeans \ 
    -i <input vectors directory> \ 
    -c <input clusters directory> \ 
    -o <output working directory> \ 
    -k <optional number of initial clusters to sample from input vectors> \ 
    -dm <DistanceMeasure> \ 
    -x <maximum number of iterations> \ 
    -cd <optional convergence delta. Default is 0.5> \ 
    -ow <overwrite output directory if present> 
    -cl <run input vector clustering after computing Canopies> 
    -xm <execution method: sequential or mapreduce>

來源

2014-12-08 user3527975

[這裏]（http://unmeshasreeveni.blogspot.in/2014/11/how-to-run-k-means-clustering-in-mahout.html）是運行綜合控制數據的一個例子。 – 2014-12-08 03:38:20

A）亨利馬烏是slooooow。如果您的數據適合主內存，請使用其他工具，如ELKI。他們遠遠超過了Mahout。如果你的數據不適合主存：你確定k-means對你的數據有什麼意義嗎？做一個不能解決問題的計算是沒有意義的。從樣本開始，首先檢查它是否可用，然後放大。 Mahout是最後的選擇：如果你絕對需要在所有數據上計算這個數據，並且其他所有數據都失敗了，那麼使用Mahout。

B）閱讀所有的文檔......亨利馬烏k均值的文檔中的下一行說：

注：如果-k提供參數，在-c目錄中的任何集羣將被覆蓋， -k隨機點將從輸入向量中採樣成爲初始聚類中心。

換句話說：如果你知道初始聚類中心，通過-c提供給他們做不設置-k。否則一個空的-c文件夾是好的，如果您提供-k，要採樣的聚類中心的數量。

來源

2014-12-08 13:14:01

使用mahout進行Kmeans聚類

回答

相關問題