2017-09-05 125 views
0

我有以下的數據幀CountVectorizer提取特徵

+------------------------------------------------+ 
|filtered          | 
+------------------------------------------------+ 
|[human, interface, computer]     | 
|[survey, user, computer, system, response, time]| 
|[eps, user, interface, system]     | 
|[system, human, system, eps]     | 
|[user, response, time]       | 
|[trees]           | 
|[graph, trees]         | 
|[graph, minors, trees]       | 
|[graph, minors, survey]       | 
+------------------------------------------------+ 

以上專欄中,我得到下面的輸出運行CountVectorizer

+------------------------------------------------+------------------- 

--------------------------+ 
|filtered          |features          | 
+------------------------------------------------+---------------------------------------------+ 
|[human, interface, computer]     |(12,[4,7,9],[1.0,1.0,1.0])     | 
|[survey, user, computer, system, response, time]|(12,[0,2,6,7,8,11],[1.0,1.0,1.0,1.0,1.0,1.0])| 
|[eps, user, interface, system]     |(12,[0,2,4,10],[1.0,1.0,1.0,1.0])   | 
|[system, human, system, eps]     |(12,[0,9,10],[2.0,1.0,1.0])     | 
|[user, response, time]       |(12,[2,8,11],[1.0,1.0,1.0])     | 
|[trees]           |(12,[1],[1.0])        | 
|[graph, trees]         |(12,[1,3],[1.0,1.0])       | 
|[graph, minors, trees]       |(12,[1,3,5],[1.0,1.0,1.0])     | 
|[graph, minors, survey]       |(12,[3,5,6],[1.0,1.0,1.0])     | 
+------------------------------------------------+---------------------------------------------+ 

現在我想運行的功能列的地圖功能和轉換它變成這樣的東西

+------------------------------------------------+--------------------------------------------------------+ 
|features          |transformed            | 
+------------------------------------------------+--------------------------------------------------------+ 
|(12,[4,7,9],[1.0,1.0,1.0])      |["1 4 1", "1 7 1", "1 9 1"]        | 
|(12,[0,2,6,7,8,11],[1.0,1.0,1.0,1.0,1.0,1.0]) |["2 0 1", "2 2 1", "2 6 1", "2 7 1", "2 8 1", "2 11 1"] | 
|(12,[0,2,4,10],[1.0,1.0,1.0,1.0])    |["3 0 1", "3 2 1", "3 4 1", "3 10 1"]     | 
[TRUNCATED] 

方式特點tran通過從特徵中提取中間數組,然後從中創建子數組。例如,在第1行和col 1列的features我們

(12,[4,7,9],[1.0,1.0,1.0]) 

現在把它的中間陣列是[4,7,9]與第三列是[1.0,1.0,1.0]前面加上「1」,因爲它是第1行,以獲得比較其頻率以下的輸出:

["1 4 1", "1 7 1", "1 9 1"] 

這在一般看起來像這樣:

["RowNumber MiddleFeatEl CorrespondingFreq", ....] 

我不能夠提取中東最後頻率清單通過應用映射函數由CountVectorizer生成功能列分別

所以下面是地圖代碼:

def corpus_create(feats): 
    return feats[1] # Here i want to get [4,7,9] instead of 1 single feat score. 

corpus_udf = udf(lambda feats: corpus_create(feats), StringType()) 
df3 = df.withColumn("corpus", corpus_udf("features")) 

回答

2

行數都是在星火SQL基本無意義,但如果你不介意的話:

def f(x): 
    row, i = x 
    jvs = (
     # SparseVector 
     zip(row.features.indices, row.features.values) if hasattr(row.features, "indices") 
     # DenseVector 
     else enumerate(row.features.toArray())) 

    s = ["{} {} {}".format(i, j, v) 
     for j, v in jvs if v] 
    return row + (s,) 


df.rdd.zipWithIndex().map(f).toDF(df.columns + ["transformed"])