2016-09-29 105 views
2

我有具有多個值的列一個數據幀列的一個熱編碼(逗號分隔):ř數據幀 - 包含多個術語

mydf <- structure(list(Age = c(99L, 10L, 40L, 15L), 
         Info = c("good, bad, sad", "nice, happy, joy", "NULL", "okay, nice, fun, wild, go"), 
         Target = c("Boy", "Girl", "Boy", "Boy")), 
        .Names = c("Age", "Info", "Target"), 
        row.names = c(NA, 4L), 
        class = "data.frame") 

> mydf 
    Age      Info Target 
1 99   good, bad, sad Boy 
2 10   nice, happy, joy Girl 
3 40      NULL Boy 
4 15 okay, nice, fun, wild, go Boy 

信息柱分成一熱 - 編碼列,並追加除了目標列的結果,例如:

Age      Info Target good bad sad nice ... NULL .. 
1 99   good, bad, sad Boy 1 1 1 0  0 
2 10   nice, happy, joy Girl 0 0 0 1  0 
3 40      NULL Boy 0 0 0 0  1 
4 15 okay, nice, fun, wild, go Boy 0 0 0 0  0 

在Python中我可以做類似下面,獲得一本字典,然後用它來分配列。

In [1]: import itertools 

In [2]: values = ["good, bad, sad", "nice, happy, joy", "NULL", "okay, nice, fun, wild, go"] 

In [3]: terms = list(itertools.chain(*[v.split(", ") for v in values])) 

In [4]: dictionary = {v:k for k,v in enumerate(terms)} 

In [6]: dictionary 
Out[6]: 
{'NULL': 6, 'bad': 1, 
'fun': 9, 'go': 11, 'good': 0, 'happy': 4, 
'joy': 5, 'nice': 8, 'okay': 7, 'sad': 2, 'wild': 10} 

到目前爲止,我在R中可以

> lapply(mydf["Info"], function(x) { strsplit(x, ", ") }) 
$Info 
$Info[[1]] 
[1] "good" "bad" "sad" 

$Info[[2]] 
[1] "nice" "happy" "joy" 

$Info[[3]] 
[1] "NULL" 

$Info[[4]] 
[1] "okay" "nice" "fun" "wild" "go" 

我沒有得到如何將它轉換成R A字典做到這一點,並用它來轉換成列獨熱編碼。

我該如何解決這個問題?

回答

5

一種選擇是從mtabulate通過qdapTools,

library(qdapTools) 
cbind(mydf, mtabulate(strsplit(mydf$Info, ", "))) 
#Age      Info Target bad fun go good happy joy nice NULL okay sad wild 
#1 99   good, bad, sad Boy 1 0 0 1  0 0 0 0 0 1 0 
#2 10   nice, happy, joy Girl 0 0 0 0  1 1 1 0 0 0 0 
#3 40      NULL Boy 0 0 0 0  0 0 0 1 0 0 0 
#4 15 okay, nice, fun, wild, go Boy 0 1 1 0  0 0 1 0 1 0 1 
分裂 '信息' 欄後