R中的數據幀的複雜重組

我已經得到了一些DNA測序程序的結果，這些結果沒有以特別有用的方式提供。目前，我的數據框中的第一列有物種名稱，其餘的列包含所有檢測到該物種DNA的板孔。下面是數據表的樣本是這樣的：R中的數據幀的複雜重組

Species     V1  V2  V3  V4  V5 
Eupeodes corollae  1-G3 1-F1 1-E11 1-C10 1-A3 
Diptera     1-A10 1-B2 1-C1 1-G7 1-E11 
Episyrphus balteatus 2-C3 2-A10 1-C11 1-A10 2-B4 
Aphidie     1-B9 1-D7 2-A3 1-C8 2-C11 
Ericaphis    1-B9 1-D7 2-A3 1-C8 2-C11 
Hemiptera    1-B9 1-D7 2-A3 1-C8 2-C11

最後，我想與其中第一列包含所有板的孔中，其餘列包含中確定的所有種類的數據幀落得在每個孔中，像這樣：

Well Species1    Species2    Species3 
1-A1 Eupeodes corollae  Ericaphis 
1-A2 Episyrphus balteatus 
1-A3 Aphidie   
1-A4 Hemiptera    Episyrphus balteatus Aphidie 
1-A5 Diptera

我猜測，這將是一個兩步過程，其中該表是第一重形長格式的每孔種比賽，然後一個新的記錄在第二階段，合併記錄以便每個井只出現在第一列中，並且在該井中發現的所有物種都列在w旁邊埃爾名字。但是，恐怕這種複雜的重塑已經超出了我在R的能力範圍。有誰能提供一些關於我如何去做這件事的建議嗎？

來源

2017-10-13 D.Hodgkiss

您的想法很有用，並且有很多套件可以很快完成。

在tidyverse包中，您所描述的操作分別封裝在名爲gather和spread的函數中。有一個非常酷的cheatsheet produced by R Studio涵蓋了這些類型的數據爭奪活動。

與您的數據的技巧是，通常，蔓延期望有一個獨特的一組列。好消息是，你可以解決這個問題的方式有兩種：

1.新的獨特的列上創建一個佔位符變量，並傳播使用佔位符作爲關鍵

library(tidyr) 
    library(dplyr) 

    output <- 
     input %>% 
     # bring all of the data into a long table 
     gather(Plate, Well, V1:V5) %>% 
     # remove the column with the old column names, 
     # this column will cause problems in spread if not removed 
     select(-Plate) %>% 
     # create the placeholder variable 
     group_by(Well) %>% 
     mutate(NewColumn = seq(1, n())) %>% 
     # spread the data out based on the new column header 
     spread(NewColumn, Species)

根據使用以及是否需要它，您可以在傳播函數之前或之後重命名標題列。

OR：

2.更改所需的輸出咯，給你每個物種的一列

library(tidyr) 
    library(dplyr) 

    output <- 
     input %>% 
     # bring all of the data into a long table 
     gather(Plate, Well, V1:V5) %>% 
     # remove the column with the old column names, 
     # this column will cause problems in spread if not removed 
     select(-Plate) %>% 
     # count the number of unique combinations of well and species 
     count(Well, Species) %>% 
     # spread out the counts 
     # fill = 0 sets the values where no combinations exist to 0 
     spread(Species, n, fill = 0)

這給你一個不同的輸出，但我提到它，因爲它可以更容易查看是否有多個相同數據集的實例（例如，兩個相同的物種），並將數據很好地設置用於未來的分析。參考

重現數據：

input <- tibble(
    Species = c(
     "Eupeodes corollae", 
     "Diptera", 
     "Episyrphus balteatus", 
     "Aphidie", 
     "Ericaphis", 
     "Hemiptera" 
    ), 
    V1 = c("1-G3 ", "1-A10", "2-C3", "1-B9", "1-B9", "1-B9"), 
    V2 = c("1-F1", "1-B2", "2-A10", "1-D7", "1-D7", "1-D7"), 
    V3 = c("1-E11", "1-C1" , "1-C11", "2-A3", "2-A3", "2-A3"), 
    V4 = c("1-C10", "1-G7", "1-A10", "1-C8", "1-C8", "1-C8"), 
    V5 = c("1-A3", "1-E11", "2-B4", "2-C11", "2-C11", "2-C11") 
)

來源

2017-10-13 16:01:44

這是的cheatsheet爲更簡單的問題真正有用的和你全面的回答非常有幫助。我特別讚賞第一種方法的解釋。現在我已經能夠製作出我之前使用過的表格。 –

R中的數據幀的複雜重組

回答

相關問題