2017-10-09 68 views
1

我有一些來自不同人的生物時間序列波形數據,並且一直使用zoo包來存儲數據。玩具例子:R - 將組/條件變量添加到時間序列

library(zoo) 
w1 <- sin(seq(0,20,0.25)) 
w2 <- cos(seq(0,20,0.25)) 
df <- data.frame(w1,w1,w1,w2,w2,w2) 
names(df) <- paste("waves", 1:6, sep="") 
waves <- zoo(df) 

但我也有每個人(例如,他們的年齡,性別,健康狀況)一堆額外的組/條件變量。所以想象一下,如果我現在需要對健康人羣的波形做些什麼。

據我所知,動物園和xts對象都不接受額外的變量。所以我的計劃是維護這些額外變量的查找數據框。例如:

lookup <- data.frame(index = paste("waves", 1:6, sep=""), 
        group = c("healthy", "unhealthy")) 

所以,現在,如果我需要品嚐健康的人,我可以這樣做:

select <- waves[, lookup$index[lookup$group=="healthy"]] 

是否有更好的方法或數據結構來管理時間序列+額外的變量?

+0

你可以在這裏使用'data.table'。 – agstudy

回答

1

您要找的是一個面板數據結構。面板數據(也稱爲橫截面時間序列數據)是隨時間以及實體間而變化的數據。在你的情況下,你的wavesvalue在每個實體內隨時間變化,而group因實體而異。我們可以做一個簡單的gatherjoin來得到一個典型的面板數據格式。

library(tidyr) 
library(dplyr) 
panel_df = df %>% 
    gather(index, value) %>% 
    inner_join(lookup, by = "index") %>% 
    group_by(index) %>% 
    mutate(time = 1:n()) 

#  index  value group time 
#  <chr>  <dbl> <chr> <int> 
# 1 waves1 0.0000000 healthy  1 
# 2 waves1 0.2474040 healthy  2 
# 3 waves1 0.4794255 healthy  3 
# 4 waves1 0.6816388 healthy  4 
# 5 waves1 0.8414710 healthy  5 
# 6 waves1 0.9489846 healthy  6 
# 7 waves1 0.9974950 healthy  7 
# 8 waves1 0.9839859 healthy  8 
# 9 waves1 0.9092974 healthy  9 
# 10 waves1 0.7780732 healthy 10 
# # ... with 476 more rows 

這裏,index表示實體尺寸和我已經手動創建一個time變量以指示面板數據的時間維度。

爲了形象化的面板數據,你可以不喜歡與ggplot2如下:

library(ggplot2) 
# Visualize all waves, grouped by health status 
ggplot(panel_df, aes(x = time, y = value, group = index)) + 
    geom_line(aes(color = group)) 

enter image description here

# Only Healthy people 
panel_df %>% 
    filter(group == "healthy") %>% 
    ggplot(aes(x = time, y = value, color = index)) + 
    geom_line() 

# Compare healthy and unhealthy people's waves 
panel_df %>% 
    ggplot(aes(x = time, y = value, color = index)) + 
    geom_line() + 
    facet_grid(. ~ group) 

enter image description here

與時間維度工作:

# plot acf for each entity `value` time series 
par(mfrow = c(3, 2)) 
by(panel_df$value, panel_df$index, function(x) acf(x)) 

enter image description here

library(forecast) 
panel_df %>% 
    filter(index == "waves1") %>% 
    {autoplot(acf(.$value))} 

enter image description here

最後,plm包是極好的與面板數據的工作。來自計量經濟學的各種面板迴歸模型已經實現,但爲了不再提供這個答案,我只會留下一些鏈接供自己研究。pdim告訴你的實體和時間維度的面板數據,以及它是否是平衡的:

library(plm) 
# Check dimension of Panel 
pdim(panel_df, index = c("index", "time")) 
# Balanced Panel: n=6, T=81, N=486 
  1. What is Panel Data?
  2. Getting Started in Fixed/Random Effects Models using R
  3. Regressions with Panel Data

我已經修改了你的數據更好示範。

數據:

library(zoo) 
w1 <- sin(seq(0,20,0.25)) 
w2 <- cos(seq(0,20,0.25)) 
w3 = w1*2 
w4 = w2*0.5 
w5 = w1*w2 
w6 = w2^2 

df <- data.frame(w1,w2,w3,w4,w5,w6, stringsAsFactors = FALSE) 
names(df) <- paste("waves", 1:6, sep="") 
waves <- zoo(df) 

lookup <- data.frame(index = paste("waves", 1:6, sep=""), 
        group = c("healthy", "unhealthy"), 
        stringsAsFactors = FALSE) 
+0

哇。感謝您的超級詳細和有益的答案。對此,我真的非常感激! – Runic