tidyr ::與缺少數據收集na.rm

假設我在數據框中有多個列，它們測量相同的概念，但使用不同的方法（例如，有多種智商測試，學生可以有任何一種他們，或根本沒有）。我想將各種方法合併到一個列中（tidyr的明顯用例）。tidyr ::與缺少數據收集na.rm

如果數據是這樣的：

mydata <- data.frame(ID = 55:64, 
       age = c(12, 12, 14, 11, 20, 10, 13, 15, 18, 17), 
       Test1 = c(100, 90, 88, 115, NA, NA, NA, NA, NA, NA), 
       Test2 = c(NA, NA, NA, NA, 100, 120, NA, NA, NA, NA), 
       Test3 = c(NA, NA, NA, NA, NA, NA, 110, NA, 85, 150))

我自然要執行這樣的事情（請注意，我用na.rm = TRUE，以沒有很多很多的NA在我的數據設定得到自己行）：

library(tidyr) 
tests <- gather(mydata, key=IQSource, value=IQValue, c(Test1, Test2, Test3), na.rm = TRUE) 
tests

給予我：

ID age IQSource IQValue 1 55 12 Test1 100 2 56 12 Test1 90 3 57 14 Test1 88 4 58 11 Test1 115 15 59 20 Test2 100 16 60 10 Test2 120 27 61 13 Test3 110 29 63 18 Test3 85 30 64 17 Test3 150

問題是我有一個學生（ID = 62），其中任何一個都沒有任何智商分數，我不想丟失她的其他數據（ID和年齡的數據列）。

有沒有辦法在tidyr中區分是的，我想刪除NA，我至少在一列中收集數據，但同時要防止所有的數據丟失採集列NA）

來源

2017-05-25 Joy

如果學生都只有一個智商測試。 ..

library(tidyverse) 

mydata %>% 
    gather(key=IQSource, value=IQValue, Test1:Test3) %>% 
    group_by(ID) %>% 
    arrange(IQValue) %>% 
    slice(1)

 ID age IQSource IQValue 
1 55 12 Test1  100 
2 56 12 Test1  90 
3 57 14 Test1  88 
4 58 11 Test1  115 
5 59 20 Test2  100 
6 60 10 Test2  120 
7 61 13 Test3  110 
8 62 15 Test1  NA 
9 63 18 Test3  85 
10 64 17 Test3  150

如果學生可以各自具有多個智商測試......

mydata %>% 
    # Add an ID with multiple IQ tests 
    bind_rows(data.frame(ID=65, age=13, Test1=100, Test2=100, Test3=NA)) %>% 
    gather(key=IQSource, value=IQValue, Test1:Test3) %>% 
    group_by(ID) %>% 
    filter(!is.na(IQValue) | all(is.na(IQValue))) %>% 
    filter(all(!is.na(IQValue)) | !duplicated(IQValue)) %>% 
    arrange(ID, IQSource)

 ID age IQSource IQValue 
1 55 12 Test1  100 
2 56 12 Test1  90 
3 57 14 Test1  88 
4 58 11 Test1  115 
5 59 20 Test2  100 
6 60 10 Test2  120 
7 61 13 Test3  110 
8 62 15 Test1  NA 
9 63 18 Test3  85 
10 64 17 Test3  150 
11 65 13 Test1  100 
12 65 13 Test2  100

來源

2017-05-25 23:09:06 eipi10

我選擇這個作爲正確的答案b/c簡單，堅持tidyverse，並擴大超出原來的要求。所有給出的答案都很棒，但是很有幫助！謝謝大家！ – Joy

我認爲這將這樣的伎倆爲您提供：？

# make another data frame which has just ID and whether or not they missed all 3 tests 
    missing = mydata %>% 
     mutate(allNA = is.na(Test1) & is.na(Test2) & is.na(Test3)) %>% 
     select(ID, allNA) 

    # Gather and keep NAs 
    tests <- gather(mydata, key=IQSource, value=IQValue, c(Test1, Test2, Test3), na.rm = FALSE) 

    # Keep the rows that have a IQValue or missed all tests 
    tests = left_join(tests, missing) %>% 
     filter(!is.na(IQValue) | allNA) 
    # Remove duplicated rows of individuals who missed all exams 
    tests = tests[!is.na(tests$IQValue) | !duplicated(tests[["ID"]]), ]

來源

2017-05-25 21:26:36 svenhalvorson

我did'nt找到一個直接的解決方案，但你可以right_join回到原來的data.frame，然後取消選擇所有你不需要的列。

library(tidyr) 
library(dplyr) 

mydata %>% 
    gather(key, val, Test1:Test3, na.rm = T) %>% 
    right_join(mydata) %>% 
    select(-contains("Test")) 
#> Joining, by = c("ID", "age") 
#> ID age key val 
#> 1 55 12 Test1 100 
#> 2 56 12 Test1 90 
#> 3 57 14 Test1 88 
#> 4 58 11 Test1 115 
#> 5 59 20 Test2 100 
#> 6 60 10 Test2 120 
#> 7 61 13 Test3 110 
#> 8 62 15 <NA> NA 
#> 9 63 18 Test3 85 
#> 10 64 17 Test3 150

或者，你當然可以先創建所有你想保持變量data.frame，然後加入吧：

id_data <- select(mydata, ID, age) 

mydata %>% 
    gather(key, val, Test1:Test3, na.rm = T) %>% 
    right_join(id_data)

來源

2017-05-25 21:26:57

韋爾普這比我更好了很多。謝謝！ – svenhalvorson

tidyr ::與缺少數據收集na.rm

回答

相關問題