2016-02-05 36 views
0

我有一個包含〜300,000行和60列的大型數據集。目前,如果我想在我的變量中使用獨特的特徵子集,我使用unique()函數創建該變量中所有唯一值的data.frame列表。然後,我將它與主數據框相匹配,以從我的主文件中獲取關聯的數據。有沒有簡單的方法來爲獨特的值和它們相關的數據進行子集分類?

但是這個過程有點麻煩,所以我想知道是否有更快的方法來做同樣的事情?例如,是否有一個函數可用於選擇唯一字段以及與這些值相關的關聯數據?

例如:我想創建一個新的數據框,其中只包含唯一的SurveyID_Block ID及其相關的島代碼和豐度。

structure(list(SurveyID_Block = c("62003713_2", "62003087_2", 
"62003713_2", "62003713_2", "62003713_1", "62003713_2", "62003713_1", 
"62003713_2", "62003713_2", "62003087_1", "62003713_1", "62003713_1", 
"62003713_2", "62003713_2", "62003713_1", "62003087_1", "62003087_2", 
"62003713_2", "62003713_2", "62003713_2", "62003087_2", "62003713_2", 
"62003713_1", "62003713_1", "62003713_1", "62003713_1", "62003713_2", 
"62003713_1", "62003713_2", "62003087_1", "62003713_2", "62003087_1", 
"62003713_1", "62003087_2", "62003087_2", "62003713_2", "62003713_1", 
"62003087_1", "62003713_1", "62003713_1", "62003713_1", "62003087_2", 
"62003087_2", "62003713_2", "62003713_2", "62003713_2", "62003713_1", 
"62003087_1", "62003713_2", "62003087_2", "62003713_1", "62003713_1", 
"62003713_2", "62003713_1", "62003713_2", "62003087_2", "62003087_2", 
"62003087_1", "62003087_1", "62003713_1", "62003087_1", "62003087_1", 
"62003087_2", "62003087_2", "62003713_2", "62003713_1", "62003713_2", 
"62003713_2", "62003713_2", "62003713_1", "62003713_2", "62003087_1", 
"62003713_1", "62003713_1", "62003087_1", "62003087_1", "62003713_1", 
"62003087_2", "62003087_1", "62003087_2", "62003087_2", "62003087_1", 
"62003087_1", "62003087_1", "62003713_2", "62003087_2", "62003713_2", 
"62003087_2", "62003713_1", "62003713_1", "62003087_2", "62003087_1", 
"62003087_1", "62003087_1", "62003713_2", "62003713_2", "62003087_1", 
"62003713_1", "62003087_1", "62003087_2"), IslandCode = c(1391L, 
1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 
1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 
1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 
1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 
1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 
1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 
1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 
1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 
1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 
1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 
1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L 
), totalAbun = c(667L, 174L, 667L, 667L, 715L, 667L, 715L, 667L, 
667L, 1365L, 715L, 715L, 667L, 667L, 715L, 1365L, 174L, 667L, 
667L, 667L, 174L, 667L, 715L, 715L, 715L, 715L, 667L, 715L, 667L, 
1365L, 667L, 1365L, 715L, 174L, 174L, 667L, 715L, 1365L, 715L, 
715L, 715L, 174L, 174L, 667L, 667L, 667L, 715L, 1365L, 667L, 
174L, 715L, 715L, 667L, 715L, 667L, 174L, 174L, 1365L, 1365L, 
715L, 1365L, 1365L, 174L, 174L, 667L, 715L, 667L, 667L, 667L, 
715L, 667L, 1365L, 715L, 715L, 1365L, 1365L, 715L, 174L, 1365L, 
174L, 174L, 1365L, 1365L, 1365L, 667L, 174L, 667L, 174L, 715L, 
715L, 174L, 1365L, 1365L, 1365L, 667L, 667L, 1365L, 715L, 1365L, 
174L)), .Names = c("SurveyID_Block", "IslandCode", "totalAbun" 
), row.names = c(NA, 100L), class = "data.frame") 
+0

什麼是預期的輸出?如果'SurveyID_Block'總是具有相同的其他屬性,那麼'unique(df)'不會起作用嗎?這給了我4行。 – Ananta

+0

嗨安娜。這確實有用!所以我的問題是獨特的功能如何知道要選擇的列變量?例如,爲什麼它沒有爲IslandCode這麼做? – pr1g11

+0

@ pr1g11您是否嘗試過在下面發佈的「split」解決方案或在dupe鏈接中? – akrun

回答

1

我們可以通過split 'SurveyID_Block' 數據集創建data.framelist秒。最好將數據集保存在list中,而不是在全局環境中創建單個數據幀對象。

lst <- split(df1, df1$SurveyID_Block) 

但是,如果我們需要創建一個單獨的數據集,也可以用list2env

list2env(setNames(lst, paste0('dfN', seq_along(lst))), 
      envir=.GlobalEnv) 
相關問題