爲無監督學習生成合成數據

我想爲無監督學習隨機森林準備數據。的程序如下：爲無監督學習生成合成數據

取數據和值1添加屬性「類」的所有實例
生成原始數據合成數據：
- ，而你沒有相同數量的的例子如在原始數據構建的例子：
  - 樣品新的屬性從屬性的所有值在原始數據
  - 值做到這一點對所有的屬性，並將它們組合成新的實施例
分配給屬性綜合數據值2
綁定兩個數據一起

的 '類' 在結束它看起來像這樣：

 ...  Class 
       |1 
    Original |1 
    Data  |1 
       |1 
    -------------- 
       |2 
    Synthetic |2 
    Data  |2 
       |2

我ř代碼如下所示：

library(gtools) #for smartbind() 

sample1 <- function(X) { sample(X, replace=T) } 
g1  <- function(dat) { apply(dat,2,sample1) } 

data$class <- rep(1, times=nrow(data)) #add attribute 'class' with value 1 

synthData<-data.frame(g1(data[,1:ncol(data)])) #generate synthetic data with sampling from data 
synthData$class <- rep(2, times=nrow(synthData)) #attribute 'class' is 2 
colnames(synthData) <- colnames(data) 
newData <- smartbind(data, synthData) #bind the data together

很可能很明顯，我對R真的很陌生，但它的工作原理 - 只有一個問題：合成數據中屬性的類型與原始數據中的屬性不同。如果原來他們是數字，現在他們成爲因素。如何在生成合成數據時保留相同的類型？

謝謝！

數據1（NUMS成爲因素）：

結構（列表（V2 = C（1.51793，1.51711，1.51645，1.51916，1.51131 ），V3 = C（13.21，12.89，13.44，14.15，13.69 ），V4 = c（3.48,3.62,0.3.61,0.3.2），V5 = c（1.41,1.57,1.54,2.09,1.81），V6 = c（72.64, 72.96,72.39,72.74,72.81），V7 = = C（0.59,0.61,0.66,0,1.76, ），V8 = c（8.43,8.11,8.03,10.88,5.43），V9 = c（0,0,0,0, 1.19），V10 = c（ 0，0,0,0），realClass = structure（c（1L，2L， 2L，5L，6L），.Label = c（「1」，「2」，「3」，「5」，「6」，「7」），class =「factor」）），.Names = c（「V2」，「V3」「V4」「V5」「V6」「V7」「V8」「V9」「V10」「realClass」，183L，186L）中，class = 「data.frame」）

數據2（因素成爲CHRS）：

結構（列表（realClass =結構（C（2L，2L，2L，1L ，2L），.Label = c（「e」，「p」），class =「factor」），V2 =結構（c（6L，3L，4L，6L，6L），.Label = c（「b 「，」「，」c「，」f「，」k「，」s「，」x「），class =」factor「），V3 =結構（c（4L， 4L，3L，1L，1L）標籤= c（「f」，「g」，「s」，「y」），class =「factor」）， V4 =結構（c（5L，5L，5L，3L，4L），.Label = c（「b」，「c」，「e」，「g」，「n」，「p」，「r」（1L，1L，1L，2L，1L），...，標籤= c（「f」，「t」），class =「factor」），V6 =結構（c（3L，9L，3L，6L，3L ），。標籤= c（「a」，「c 「，」f「，」l「，」m「，」n「，」p「，」s「，」y「，class =」factor「），V7 = structure（c（2L，2L，2L ，2L，2L，），。標籤= c（「a」，「f」），等級=「因子」），V8 =結構（c（1L， 1L，1L，1L，1L），。標籤= c （「c」，「w」），class =「factor」），V9 =結構（c（2L，2L，2L，1L，1L），.Label = c（「b」，「n」）），V10 =結構（c（1L，1L，1L，10L， 4L），.Label = c（「b」，「e」，「g」，「h」，「k」「n」，「o」，「p」，「r」，「u」，「w」，「y」），class =「factor」），V11 = structure（c（2L， 2L，2L， 2L，1L），.Label = c（「e」，「t」），class =「factor」）， V12 =結構（c（NA，NA，NA， 1L，1L），.Label = c（「b」，「c」，「e」，「r」），class =「factor」），V13 = structure（c（3L，2L，3L， 3L， 2L），.Label = c（「f」，「k」，「s」，「y」），class =「因子」），結構（c（3L，3L，2L，3L，2L）， .Label = c（「f」，「k」，「s」，「y」），class =「factor」），V15 =結構（c（7L，8L，7L， 4L，7L） = c（「b」，「c」，「e」，「g」，「n」，「o」，「p」，「w」，「y」），class =「factor」結構（c（7L，7L，8L，4L， 1L），.Label = c（「b」，「c」，「e」，「g」，「n」，「o」，「p」 V17 =結構（c（1L，1L，1L，1L，1L，），。標籤=「p」，等級=「因子」），V18 =結構（c（3L， 3L，3L，3L，3L），.Label = c（「n」，「o」，「w」，「y」），class =「factor」），V19 =結構（c（2L，2L，2L，2L，2L），.Label = c（「n」，「o」，「t」），class =「factor」），V20 = structure（c（1L， 1L，1L，5L， 3L），.Label = c（「e」，「f」，「l」，「n」，「p」），class =「因子」）， 8L，8L，4L，2L）。標籤= c（「b」，「h」，「k」，「n」，「o」，「r」，「u」，「w」 y「，class =」factor「），V22 = structure（c（5L， 5L，5L，5L，6L），.Label = c（」a「，」c「，」n「，」s「「v」，「y」），class =「因子」），結構（c（3L，3L，5L，1L，2L），.Label = c（「d」，「g」，「「），」N「= c（」realClass「，」V2「，」V3「，」V4「，」m「，」p「，」u「 V5，V6，V7，V8，V9，V10，V11，「V12」，「V13」，「V14」，「V15」，「V16」，「V17」「，」V18「，」V19「，」V20「，」V21「，」V22「，」V23「）行。名稱= C（4105L，6207L，6696L，2736L，3756L ）的class = 「data.frame」）

來源

2012-08-05 Uros K

既然你不顯示你的數據不是很明顯看出，爲什麼你的因素在地方的數字，但你可以做'numcol < - as.numeric（ as.character（factcol））' – dickoa 2012-08-05 22:28:42

是的，這有效。是否有更通用的解決方案，以便不管屬性的類型如何，它們在過程之後都保持不變？ – 2012-08-05 22:31:21

通過可重現的示例更容易找到答案。在這種情況下，我們對數據（'str（data）'或更好的'dput（data）'）不太瞭解。 – dickoa 2012-08-05 22:36:50

你總是可以用這一招有數字列

numcol <- as.numeric(as.character(factcol))

但我懷疑你的data.frame中有因子變量。由於apply返回一個矩陣，如果您的數據中有一個因子，則所有數值變量也將被強制分解。

下面是一個例子，使用數據集玩具

set.seed(123) 
toydat <- data.frame(A = 1:10, B = rnorm(10), C = LETTERS[1:10]) 
str(toydat) 

## 'data.frame': 10 obs. of 3 variables: 
## $ A: int 1 2 3 4 5 6 7 8 9 10 
## $ B: num -0.5605 -0.2302 1.5587 0.0705 0.1293 ... 
## $ C: Factor w/ 10 levels "A","B","C","D",..: 1 2 3 4 5 6 7 8 9 10 

set.seed(1) 
str(data.frame(apply(toydat[,1:2], 2, sample, replace = TRUE))) 

## 'data.frame': 10 obs. of 2 variables: 
## $ A: num 3 4 6 10 3 9 10 7 7 1 
## $ B: num 1.5587 -0.2302 0.4609 0.0705 -1.2651 ... 

# with the factor column C  
set.seed(2) 
str(data.frame(apply(toydat[,1:3], 2, sample, replace = TRUE))) 

## 'data.frame': 10 obs. of 3 variables: 
## $ A: Factor w/ 6 levels "10"," 2"," 5",..: 2 5 4 2 1 1 2 6 3 4 
## $ B: Factor w/ 8 levels " 0.129288","-0.230177",..: 8 7 6 2 1 5 3 7 1 4 
## $ C: Factor w/ 6 levels "B","D","E","G",..: 4 2 5 1 2 3 1 2 6 1

這就是plyr包成爲有用的，因爲可以控制輸出（使用**簾布層）。但是，在這種情況下，colwise功能足以

require(plyr) 
set.seed(2) 
mysamplingfun <- colwise(function(x) sample(x, replace = TRUE)) 
str(mysamplingfun(toydat[,1:3])) 

## 'data.frame': 10 obs. of 3 variables: 
## $ A: int 2 8 6 2 10 10 2 9 5 6 
## $ B: num 1.715 1.559 -1.265 -0.23 0.129 ... 
## $ C: Factor w/ 10 levels "A","B","C","D",..: 7 4 9 2 4 5 2 4 10 2

來源

2012-08-05 22:55:35 dickoa

是的，colwise做我需要的。謝謝，我非常感謝你幫助我的努力。 – 2012-08-06 14:57:48

爲無監督學習生成合成數據

回答

相關問題