0
我有一個數據框,需要根據定義的組進行摺疊。數據由數百個組組成。每個組可以有2-5行的任何地方。爲了簡單起見,我的例子顯示了3組2-4行。使用數據幀列表的規則合併數據
我想扁平每個組內的複製品。對於組中的每一列,我想返回不是NA的最大出現值。問題在於如何在平局的情況下做什麼。對於關係,我需要根據綁定的值的類型設置自定義規則。一個潛在的絕望選擇是將綁定值粘貼在一起,用逗號分隔,我可以用find/replace方式處理它們。
爲了獲得最大值,我可以使用max函數。有關如何處理關係的建議?
#Input Data Example
> data
Group Loc1 Loc2 Loc3 Loc4
1 Group1 A/B A/A B/B NA
2 Group1 A/B A/A B/B A/A
3 Group1 A/A A/A A/A NA
4 Group1 A/A A/A A/A NA
5 Group2 A/A NA C/C B/B
6 Group2 B/B A/A C/C B/B
7 Group2 B/B A/A C/C B/B
8 Group3 B/B B/B NA B/B
9 Group3 B/B B/B NA A/A
#Desired Collapsed Output
> data.collapsed
Group Loc1 Loc2 Loc3 Loc4
1 Group1 NA A/A A/B A/A
2 Group2 B/B A/A C/C B/B
3 Group3 B/B B/B NA A/B
最終代碼(更新一月27,2015)
library(data.table)
#Data Frame
#Each group has replicates of data that need to be collapsed to make a consensus data replicate
data = rbind(c("Group1","A/B", "A/A","B/B",NA), c("Group1","A/B", "A/A","B/B","A/A"), c("Group1","A/A", "A/A","A/A",NA),
c("Group1","A/A", "A/A","A/A",NA), c("Group2","A/A", NA,"C/C","B/B"), c("Group2","B/B", "A/A","C/C","B/B"),
c("Group2","B/B", "A/A","C/C","B/B"), c("Group3","B/B", "B/B",NA,"B/B"), c("Group3","B/B", "B/B",NA,"A/A"))
colnames(data) = c("Group", "Loc1", "Loc2", "Loc3", "Loc4")
data = as.data.frame(data)
data
#Define acceptable value types; these could be used to define what to do in the case of a tie
same.letter = c("A/A","B/B","C/C")
diff.letter = c("A/B","A/C","B/C")
#Function for collapsing data with rules
RepMerge = function(col) {
z = table(col);
z.max = which(z==max(z));
ifelse(length(z.max) > 2, "NA", #if tied between more than 2 different values, report NA
ifelse(length(z.max) == 1, names(z)[z.max], #if one max value, report that value
ifelse(length(z.max) == 2 & names(z)[z.max][1] %in% same.letter & names(z)[z.max][2] %in% same.letter, paste(substring(names(z)[z.max][1],1,1),substring(names(z)[z.max][2],1,1), sep="/"), #if both max values are different but are in 'same.letter', report a combination
ifelse(length(z.max) == 2 & names(z)[z.max][1] %in% diff.letter | names(z)[z.max][2] %in% diff.letter, "NA", "Check Code")))) #if one of the max values is in diff.letter, report NA. If no cases fit the above, report "Check Code"
}
setDT(data)[,lapply(.SD,RepMerge),Group] # run function to collapse the data
謝謝 SC2
關於「不適用」,是的,我想到,當我今天早上醒來:-)。我會在我的帖子中糾正它。感謝您的建議,我會嘗試一下,讓您知道它是如何工作的。謝謝! – SC2 2015-01-27 13:39:24