R：在數據幀

結合相同的標識符我有2列，一個標識和列與名稱的數據幀。每個標識符在列ID中出現幾次（見下文）。R：在數據幀

ID   Names 
uc001aag.1 DKFZp686C24272 
uc001aag.1 DQ786314 
uc001aag.1 uc001aag.1 
uc001aah.2 AK056232 
uc001aah.2 FLJ00038 
uc001aah.2 uc001aah.1 
uc001aah.2 uc001aah.2 
uc001aai.1 AY217347

現在我想創建這樣一個數據幀：

ID   Names 
uc001aag.1 DKFZp686C24272 | DQ786314 | uc001aag.1 
uc001aah.2 AK056232 | FLJ00038 | uc001aah.1 | uc001aah.2 
uc001aai.1 AY217347

誰能幫助我？

來源

2011-05-10 Lisann

總結是相當快的，但你可以使用一個sapply解決方案並行的代碼。這可以很容易地在Windows上使用snowfall完成：

require(snowfall) 
sfInit(parallel=TRUE,cpus=2) 
sfExport("Data") 

ID <- unique(Data$ID) 
CombNames <- sfSapply(ID,function(i){ 
    paste(Data$Names[Data$ID==i],collapse=" | ") 
}) 
data.frame(ID,CombNames) 
sfStop()

並行版本會給你一個額外的加速，但單sapply方案實際上比總量更慢。 Tapply有點快，但不能使用降雪並行化。我的電腦上：

n <- 3000 
m <- 3 
Data <- data.frame(ID = rep(1:n,m), 
        Names=rep(LETTERS[1:m],each=n)) 
# using snowfall for parallel sapply  
system.time({ 
    ID <- unique(Data$ID) 
    CombNames <- sfSapply(ID,function(i){ 
    paste(Data$Names[Data$ID==i],collapse=" | ") 
    }) 
    data.frame(ID,CombNames) 
}) 
    user system elapsed 
    0.02 0.00 0.33 

# using tapply 
system.time({ 
    CombNames <- tapply(Data$Names,Data$ID,paste,collapse=" | ") 
    data.frame(ID=names(CombNames),CombNames) 
}) 
    user system elapsed 
    0.44 0.00 0.44 

# using aggregate 
system.time(
    aggregate(Names ~ ID, data=Data, FUN=paste, collapse=" | ") 
) 
    user system elapsed 
    0.47 0.00 0.47 

# using the normal sapply 
system.time({ 
    ID <- unique(Data$ID) 
    CombNames <- sapply(ID,function(i){ 
    paste(Data$Names[Data$ID==i],collapse=" | ") 
    }) 
    data.frame(ID,CombNames) 
}) 
    user system elapsed 
    0.75 0.00 0.75

注：

爲了記錄在案，更好sapply-的解決辦法是：

CombNames <- sapply(split(Data$Names,Data$ID),paste,collapse=" | ") 
data.frame(ID=names(CombNames),CombNames)

這相當於tapply。但是並行化這個實際上比較慢，因爲你必須在sfSapply內移動更多的數據。速度來自將數據集複製到每個cpu。當你的數據集很龐大時，你必須記住這一點：你將以更多的內存使用來支付速度。

來源

2011-05-10 12:14:32

您可以使用aggregate：

R> aggregate(Names ~ ID, data=tmp, FUN=paste, collapse=" | ") 
      ID           Names 
1 uc001aag.1  DKFZp686C24272 | DQ786314 | uc001aag.1 
2 uc001aah.2 AK056232 | FLJ00038 | uc001aah.1 | uc001aah.2 
3 uc001aai.1          AY217347

來源

2011-05-10 07:47:30 rcs

@rcs，這種方法效果很好，但我有一個非常大的數據集。有沒有辦法加快分析速度？由於 – Lisann 2011-05-10 08:18:13

也許並行從'plyr'包'ddply'：'ddply（TMP，（ID），功能（x）的糊狀物（X $名稱，崩潰= 「|」），.parallel = TRUE）' – rcs 2011-05-10 08:31:14

那從plyr包代碼給我此錯誤：正在加載所需的程序包：的foreach 錯誤：並行plyr操作需要另外的foreach包：警告消息：在庫（包，lib.loc = lib.loc，character.only = TRUE，logical.return = TRUE，：。沒有包稱爲「的foreach」 – Lisann 2011-05-10 08:42:20

R：在數據幀

回答

相關問題