2017-06-02 96 views
1

我正在爲學校進行任務。我需要使用for循環和cars包中的bcPower函數來轉換數據幀中的列。我的數據框命名爲bb2.df,其中包含337列球員的13列棒球統計數據。的數據是從: http://ww2.amstat.org/publications/jse/datasets/baseball.dat.txtFor R Loop中的每個列使用Box Cox從汽車軟件包中變換出來

我在使用讀取數據:

bb.df <- read.fwf("baseball.dat.txt",widths=c(4,6,6,4,4,3,3,3,4,4,4,3,3,2,2,2,2,19)) 

然後我創建只是使用數字統計的第二數據幀:

bb2.df <- bb.df[,1:13] 

library(cars) 

然後我不成功試圖構建for循環。

> bb2.df[[i]] <- bcPower(bb2.df[[i]],c) 
> for (i in 1:ncol(bb2.df)) { 
+ c <- coef(powerTransform(bb2.df[[i]])) 
+ bb2.df[[i]] <- bcPower(bb2.df[[i]],c) 
+ } 
Error in bc1(out[, j], lambda[j]) : 
    First argument must be strictly positive. 

循環似乎轉換前三列但停止。 我在做什麼錯?

+0

困難沒有數據 – HubertL

+1

說你真的必須使用一個for循環?申請家庭的職能往往是更好的選擇。 –

+0

是的,我必須使用for循環。我試圖查找如何去做,每個人似乎都想使用apply函數。 – Tim

回答

1

該溶液的柱

  • 測試是否似乎含有邏輯值和從變換
  • 替換載體零個值用少量省略它們中,實際值的範圍之外
  • 將轉換值存儲在新的數據框中,保留列和行的名稱

我還測試了所有正常性befo在改造之後。我試圖找到一個有趣的變量,即變換後的變量對於Shapiro測試具有很大的p值,但是在p值中也有很大的變化。最後,感興趣的變量在原始版本和變換版本中都被縮放,並且兩個版本被覆蓋在密度圖上。

library(car); library(ggplot2); library(reshape2) 

# see this link for column names and type hints 
# http://ww2.amstat.org/publications/jse/datasets/baseball.txt 

# add placeholder column for opening quotation mark 
bb.df <- 
    read.fwf(
    "http://ww2.amstat.org/publications/jse/datasets/baseball.dat.txt", 
    widths = c(4, 6, 6, 4, 4, 3, 3, 3, 4, 4, 4, 3, 3, 2, 2, 2, 2, 2, 17) 
) 
# remove placeholder column 
bb.df <- bb.df[,-(ncol(bb.df) - 1)] 

names(bb.df) <- make.names(
    c(
    'Salary', 'Batting average', 'OBP', 'runs', 'hits', 'doubles', 'triples', 
    'home runs', 'RBI', 'walks', 'strike-outs', 'stolen bases', 'errors', 
    "free agency eligibility", "free agent in 1991/2" , 
    "arbitration eligibility", "arbitration in 1991/2", 'name' 
) 
) 

# test for boolean/logical values... don't try to transform them 
logicals.test <- apply(
    bb.df, 
    MARGIN = 2, 
    FUN = function(one.col) { 
    asnumeric <- as.numeric(one.col) 
    aslogical <- as.logical(asnumeric) 
    renumeric <- as.numeric(aslogical) 
    matchflags <- renumeric == asnumeric 
    cant.be.logical <- any(!matchflags) 
    print(cant.be.logical) 
    } 
) 

logicals.test[is.na(logicals.test)] <- FALSE 

probably.numeric <- bb.df[, logicals.test] 

result <- apply(probably.numeric, MARGIN = 2, function(one.col) 
{ 
    # can't transform vectors containing non-positive values 
    # replace zeros with something small 
    non.zero <- one.col[one.col > 0] 
    small <- min(non.zero)/max(non.zero) 
    zeroless <- one.col 
    zeroless[zeroless == 0] <- small 
    c <- coef(powerTransform(zeroless)) 
    transformation <- bcPower(zeroless, c) 
    return(transformation) 
}) 

result <- as.data.frame(result) 

row.names(result) <- bb.df$name 

cols2test <- names(result) 

normal.before <- sapply(cols2test, function(one.col) { 
    print(one.col) 
    temp <- shapiro.test(bb.df[, one.col]) 
    return(temp$p.value) 
}) 

normal.after <- sapply(cols2test, function(one.col) { 
    print(one.col) 
    temp <- shapiro.test(result[, one.col]) 
    return(temp$p.value) 
}) 

more.normal <- cbind.data.frame(normal.before, normal.after) 
more.normal$more.normal <- 
    more.normal$normal.after/more.normal$normal.before 

more.normal$interest <- 
    more.normal$normal.after * more.normal$more.normal 

interesting <- 
    rownames(more.normal)[which.max(more.normal$interest)] 

data2plot <- 
    cbind.data.frame(bb.df[, interesting], result[, interesting]) 
names(data2plot) <- c("original", "transformed") 
data2plot <- scale(data2plot) 
data2plot <- melt(data2plot) 
names(data2plot) <- c("Var1", "dataset", interesting) 

ggplot(data2plot, aes(x = data2plot[, 3], fill = dataset)) + 
    geom_density(alpha = 0.25) + xlab(interesting) 

**Previous, incomplete response:**

原始的,不完整的答案:

我相信你正在試圖做非法的功率轉換(矢量包括非正值,特別是零;無差異向量)

事實上,你是複製bb.df到bb2.df然後覆蓋是一個肯定的跡象,你應該真的使用apply。

這不會創建一個有用的數據幀,但它應該讓你開始,

library(car) 

bb.df <- 
    read.fwf(
    "baseball.dat.txt", 
    widths = c(4, 6, 6, 4, 4, 3, 3, 3, 4, 4, 4, 3, 3, 2, 2, 2, 2, 19) 
) 

bb.df[bb.df == 0] <- NA 

# skip last (text) col 
for (i in 1:(ncol(bb.df) - 1)) { 
    print(i) 
    # use comma to indicate indexing by column 
    temp <- bb.df[, i] 
    temp[temp == 0] <- NA 
    temp <- temp[complete.cases(temp)] 
    if (length(unique(temp)) > 1) { 
    c <- coef(powerTransform(bb.df[, i])) 
    print(bcPower(bb.df[i], c)) 
    } else { 
    print(paste0("column ", i, " is invariant")) 
    } 

} 

# apply solution 

result <- apply(bb.df[,-ncol(bb.df)], MARGIN = 2, function(one.col) 
{ 
    temp <- one.col 
    temp[temp == 0] <- NA 
    temp <- temp[complete.cases(temp)] 
    if (length(unique(temp)) > 1) { 
    c <- coef(powerTransform(temp)) 
    transformation <- bcPower(temp, c) 
    return(transformation) 
    } else 
    { 
    print("skipping invariant column") 
    return(NULL) 
    } 
})