創建一個映射表的重複的ID /鍵

我有一個統計例程，不喜歡行精確重複（無ID）作爲結果爲空距離。創建一個映射表的重複的ID /鍵

因此，我首先檢測到我刪除的重複項，應用我的例程並將記錄合併回原處。

爲簡單起見，請考慮使用rownames作爲ID /密鑰。

我發現下面的方式來實現我的結果在基礎R：

data <- data.frame(x=c(1,1,1,2,2,3),y=c(1,1,1,4,4,3)) 

# check duplicates and get their ID -- cf. https://stackoverflow.com/questions/12495345/find-indices-of-duplicated-rows 
dup1 <- duplicated(data) 
dupID <- rownames(data)[dup1 | duplicated(data[nrow(data):1, ])[nrow(data):1]] 

# keep only those records that do have duplicates to preveng running folowing steps on all rows 
datadup <- data[dupID,] 

# "hash" row 
rowhash <- apply(datadup, 1, paste, collapse="_") 

idmaps <- split(rownames(datadup),rowhash) 
idmaptable <- do.call("rbind",lapply(idmaps,function(vec)data.frame(mappedid=vec[1],otherids=vec[-1],stringsAsFactors = FALSE)))

這給了我我想要的東西，即重複數據（容易）和映射表。

> (data <- data[!dup1,]) 
    x y 
1 1 1 
4 2 4 
6 3 3 
> idmaptable 
     mappedid otherids 
1_1.1  1  2 
1_1.2  1  3 
2_4   4  5

不知是否有一個更簡單的或更有效的方法（data.table/dplyr接受）。任何替代建議？

來源

2017-08-03 Eric Lecoutre

隨着data.table ...

library(data.table) 
setDT(data) 

# tag groups of dupes 
data[, g := .GRP, by=x:y] 

# do whatever analysis 
f = function(DT) Reduce(`+`, DT) 
resDT = unique(data, by="g")[, res := f(.SD), .SDcols = x:y][] 

# "update join" the results back to the main table if needed 
data[resDT, on=.(g), res := i.res ]

的OP跳過的例子（重複數據刪除數據的使用）的核心部分，所以我只是做了f。

來源

2017-08-03 14:30:40 Frank

謝謝！令人印象深刻的是它簡潔。我打算驗證這一個，重寫部分代碼以使用'data.table'。如果我想用另一種方式指定「by」列，該怎麼辦？我將有一個全局ID列（設置爲鍵），我將不得不首先將它從進程中移除 - 因爲我的重複映射過程顯然必須在沒有此ID列的情況下工作。 –

@Eric Sure。你可以做'cols = setdiff（names（data），「ID」）'，然後傳遞col ='cols'和'.SDcols = cols'。 ''data.table'包含傳遞這些參數的各種選項。有很多。我在我的筆記中還有一個列表http://franknarf1.github.io/r-tutorial/_book/tables.html#program-tables下的「指定列」 – Frank

使用tidyverse的解決方案。我通常不會在行名中存儲信息，所以我創建了ID和ID2來存儲信息。但是，當然，你可以根據你的需求來改變它。

library(tidyverse) 

idmaptable <- data %>% 
    rowid_to_column() %>% 
    group_by(x, y) %>% 
    filter(n() > 1) %>% 
    unite(ID, x, y) %>% 
    mutate(ID2 = 1:n()) %>% 
    group_by(ID) %>% 
    mutate(ID_type = ifelse(row_number() == 1, "mappedid", "otherids")) %>% 
    spread(ID_type, rowid) %>% 
    fill(mappedid) %>% 
    drop_na(otherids) %>% 
    mutate(ID2 = 1:n()) 

idmaptable 
# A tibble: 3 x 4 
# Groups: ID [2] 
    ID ID2 mappedid otherids 
    <chr> <int> <int> <int> 
1 1_1  1  1  2 
2 1_1  2  1  3 
3 2_4  1  4  5

來源

2017-08-03 13:58:41 www

謝謝。鍛鍊很好！我將驗證data.table選項，因爲我終於打算使用這個包。 –

請注意，操作是棘手的，從某種意義上說，有很多步驟和邏輯是不容易閱讀/分解/理解！ –

感謝您的評論。棘手與否取決於用戶的感受。在我的解決方案中，每一步都是一個功能，只能完成一件事和一件事。如果你知道每個函數代表什麼，你可以「大聲朗讀」。對我而言，有時這些簡潔的方法太「緊湊」了。 – www

一些改善你的基礎R解決方案，

df <- data[duplicated(data)|duplicated(data, fromLast = TRUE),] 

do.call(rbind, lapply(split(rownames(df), 
       do.call(paste, c(df, sep = '_'))), function(i) 
                data.frame(mapped = i[1], 
                  others = i[-1], 
                  stringsAsFactors = FALSE)))

其中給出，

 mapped others 
1_1.1  1  2 
1_1.2  1  3 
2_4  4  5

和當然，

unique(data) 

x y 
1 1 1 
4 2 4 
6 3 3

來源

2017-08-03 14:54:42 Sotos

的確更短。 –

創建一個映射表的重複的ID /鍵

回答

相關問題