0
我有一個包含〜300,000行和60列的大型數據集。目前,如果我想在我的變量中使用獨特的特徵子集,我使用unique()
函數創建該變量中所有唯一值的data.frame
列表。然後,我將它與主數據框相匹配,以從我的主文件中獲取關聯的數據。有沒有簡單的方法來爲獨特的值和它們相關的數據進行子集分類?
但是這個過程有點麻煩,所以我想知道是否有更快的方法來做同樣的事情?例如,是否有一個函數可用於選擇唯一字段以及與這些值相關的關聯數據?
例如:我想創建一個新的數據框,其中只包含唯一的SurveyID_Block ID及其相關的島代碼和豐度。
structure(list(SurveyID_Block = c("62003713_2", "62003087_2",
"62003713_2", "62003713_2", "62003713_1", "62003713_2", "62003713_1",
"62003713_2", "62003713_2", "62003087_1", "62003713_1", "62003713_1",
"62003713_2", "62003713_2", "62003713_1", "62003087_1", "62003087_2",
"62003713_2", "62003713_2", "62003713_2", "62003087_2", "62003713_2",
"62003713_1", "62003713_1", "62003713_1", "62003713_1", "62003713_2",
"62003713_1", "62003713_2", "62003087_1", "62003713_2", "62003087_1",
"62003713_1", "62003087_2", "62003087_2", "62003713_2", "62003713_1",
"62003087_1", "62003713_1", "62003713_1", "62003713_1", "62003087_2",
"62003087_2", "62003713_2", "62003713_2", "62003713_2", "62003713_1",
"62003087_1", "62003713_2", "62003087_2", "62003713_1", "62003713_1",
"62003713_2", "62003713_1", "62003713_2", "62003087_2", "62003087_2",
"62003087_1", "62003087_1", "62003713_1", "62003087_1", "62003087_1",
"62003087_2", "62003087_2", "62003713_2", "62003713_1", "62003713_2",
"62003713_2", "62003713_2", "62003713_1", "62003713_2", "62003087_1",
"62003713_1", "62003713_1", "62003087_1", "62003087_1", "62003713_1",
"62003087_2", "62003087_1", "62003087_2", "62003087_2", "62003087_1",
"62003087_1", "62003087_1", "62003713_2", "62003087_2", "62003713_2",
"62003087_2", "62003713_1", "62003713_1", "62003087_2", "62003087_1",
"62003087_1", "62003087_1", "62003713_2", "62003713_2", "62003087_1",
"62003713_1", "62003087_1", "62003087_2"), IslandCode = c(1391L,
1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L,
1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L,
1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L,
1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L,
1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L,
1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L,
1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L,
1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L,
1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L,
1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L,
1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L
), totalAbun = c(667L, 174L, 667L, 667L, 715L, 667L, 715L, 667L,
667L, 1365L, 715L, 715L, 667L, 667L, 715L, 1365L, 174L, 667L,
667L, 667L, 174L, 667L, 715L, 715L, 715L, 715L, 667L, 715L, 667L,
1365L, 667L, 1365L, 715L, 174L, 174L, 667L, 715L, 1365L, 715L,
715L, 715L, 174L, 174L, 667L, 667L, 667L, 715L, 1365L, 667L,
174L, 715L, 715L, 667L, 715L, 667L, 174L, 174L, 1365L, 1365L,
715L, 1365L, 1365L, 174L, 174L, 667L, 715L, 667L, 667L, 667L,
715L, 667L, 1365L, 715L, 715L, 1365L, 1365L, 715L, 174L, 1365L,
174L, 174L, 1365L, 1365L, 1365L, 667L, 174L, 667L, 174L, 715L,
715L, 174L, 1365L, 1365L, 1365L, 667L, 667L, 1365L, 715L, 1365L,
174L)), .Names = c("SurveyID_Block", "IslandCode", "totalAbun"
), row.names = c(NA, 100L), class = "data.frame")
什麼是預期的輸出?如果'SurveyID_Block'總是具有相同的其他屬性,那麼'unique(df)'不會起作用嗎?這給了我4行。 – Ananta
嗨安娜。這確實有用!所以我的問題是獨特的功能如何知道要選擇的列變量?例如,爲什麼它沒有爲IslandCode這麼做? – pr1g11
@ pr1g11您是否嘗試過在下面發佈的「split」解決方案或在dupe鏈接中? – akrun