2016-08-12 54 views
3

我是一位植物學家,也是初學者的R用戶。我想知道你是否可以幫我找到寫劇本的解決方案。我一直在使用R來優化從電子表格創建文本的過程。爲此我使用MonographaR包,我很好。問題本身正在處理data.frame。我的電子表格(CSV文件)基本上由物種欄,字符行和交叉點單元格組成。我想要一個最終腳本,它允許我將兩個或更多列合併到原始電子表格的新列中。當細胞具有不同的內容時,新的細胞內容必須通過昏迷+空間", "分開獨立的內容。當單元格具有相同的內容時,新單元格必須只有相同的內容一次,而不重複它。我試圖用連接編寫的腳本,cbind等重複了單元格的內容,我對此並不滿意。使用R - 將多個色譜柱冷凝成新色譜柱而不重複內容

我最初的CSV看起來像這樣,

 cattleya.minor cattleya.maxima cattleya.pumila 
colour red   red    red 
surface sharp   smooth   sharp 
leaves 1    3    4 

,我想有一個最終的結果是這樣

 cattleya  cattleya.minor cattleya.maxima cattleya.pumila 
colour red   red   red    red 
surface sharp, smooth sharp   smooth   sharp 
leaves 1, 3, 4  1    3    4 

非常感謝你確實。

+3

你的數據不是[整潔(http://vita.had.co.nz/papers/tidy-data.pdf),因爲你已經得到了不同類型的數據(字符串,整數)在同一列內。轉換數據會更好,因此每一列都是一個變量,每一行都是一個觀察值。 – alistaire

回答

1

As @alistaire評論說,從「整潔」數據開始,事情會變得更容易。

# Starting data (which I've called "dat") 
dat 
 cattleya.minor cattleya.maxima cattleya.pumila 
colour    red    red    red 
surface   sharp   smooth   sharp 
leaves    1    3    4 
library(reshape2) 
library(tibble) 
library(dplyr) 

# Make data tidy 
dat.tidy = dat %>% 
    rownames_to_column(var="Characteristic") %>%    # Turn rownames into a data column 
    melt(id.var="Characteristic", variable.name="Species") %>% # Reshape to "long" format 
    dcast(Species ~ Characteristic)        # Cast back to wide so that each characteristic gets its own column 

dat.tidy  
  Species colour leaves surface 
1 cattleya.minor red  1 sharp 
2 cattleya.maxima red  3 smooth 
3 cattleya.pumila red  4 sharp 
# Summarize by genus 
dat.tidy %>% 
    group_by(Genus=gsub("(.*)\\..*","\\1",Species)) %>%  # Collapse to genus (remove species designation) 
    summarise_all(funs(paste(unique(.), collapse=", "))) %>% # For each charactreristic, paste together each unique value for a given genus 
    select(-Species) 
 Genus colour leaves  surface 
1 cattleya red 1, 3, 4 sharp, smooth 
0

謝謝@allistaire & @ eipi10!

Eipi10,我很高興能接近我的目標。我完全按照您的建議和相同的數據集運行腳本。它工作得很好,但它看起來在最後一個命令塊或在線select(-Species)上有一點問題。你會檢查它嗎? [R取回我下面的:

> dat <- read.csv("dat.csv") 
> dat 
     cattleya.minor cattleya.maxima cattleya.pumila 
color    red    red    red 
surface   sharp   smooth   sharp 
leaves    1    3    4 
> 
> # Make data tidy 
> dat.tidy = dat %>% 
+ rownames_to_column(var="Characteristic") %>%    # Turn  rownames into a data column 
+ melt(id.var="Characteristic", variable.name="Species") %>% # Reshape to "long" format 
+ dcast(Species ~ Characteristic)        # Cast back to wide so that each characteristic gets its own column 
Warning message: 
attributes are not identical across measure variables; they will be dropped 
> 
> dat.tidy 
      Species color leaves surface 
1 cattleya.minor red  1 sharp 
2 cattleya.maxima red  3 smooth 
3 cattleya.pumila red  4 sharp 
> 
> # Summarize by genus 
> dat.tidy %>% 
+ group_by(Genus=gsub("(.*)\\..*","\\1",Species)) %>% # Collapse to genus (remove species designation) 
+ summarise_all(funs(paste(unique(.), collapse=", "))) # For each charactreristic, paste together each unique value for a given genus 
# A tibble: 1 x 5 
    Genus           Species color leaves   surface 
    <chr>           <chr> <chr> <chr>   <chr> 
1 cattleya cattleya.minor, cattleya.maxima, cattleya.pumila red 1, 3, 4 sharp, smooth 
> select(-Species) 
Error in select_(.data, .dots = lazyeval::lazy_dots(...)) : 
    objeto 'Species' não encontrado (my free translation: object 'Species' not found) 
> 
+0

這是因爲我在編輯我的答案時,在選擇( - 種類)之前意外刪除了'%>%'行。對於那個很抱歉。我現在修好了。如果沒有前一行中的'%>%',R會將'select(-Species)'作爲單獨的語句處理,因此會導致錯誤。 'select(-Species)'只是刪除'Species'列,但如果你想在彙總輸出中保留'Species'列,你可以刪除那一行。 – eipi10

+0

夢幻般的解決方案!非常感謝你。 –