2017-04-14 66 views
1

以下是我的數據框的示例。我在R的工作。將相應季節的列添加到數據框中

date   name  count 
2016-11-12 Joe   5 
2016-11-15 Bob   5 
2016-06-15 Nick  12 
2016-10-16 Cate  6 

我想添加一列到我的數據框,它會告訴我對應於日期的季節。我想它看起來像這樣:

date   name  count  Season 
2016-11-12 Joe   5   Winter 
2016-11-15 Bob   5   Winter 
2017-06-15 Nick  12   Summer 
2017-10-16 Cate  6   Fall 

我已經開始了一些代碼:

startWinter <- c(month.name[1], month.name[12], month.name[11]) 
startSummer <- c(month.name[5], month.name[6], month.name[7]) 
startSpring <- c(month.name[2], month.name[3], month.name[4]) 

# create a function to find the correct season based on the month 
MonthSeason <- function(Month) { 
    # !is.na() 
# ignores values with NA 
    # match() 
    # returns a vector of the positions of matches 
    # If the starting month matches a spring season, print "Spring". If the starting month matches a summer season, print "Summer" etc. 
    ifelse(!is.na(match(Month, startSpring)), 
     return("spring"), 
     return(ifelse(!is.na(match(Month, startWinter)), 
         "winter", 
         ifelse(!is.na(match(Month, startSummer)), 
           "summer","fall")))) 
} 

此代碼給了我一個月的季節。我不知道我是否以正確的方式解決這個問題。誰能幫我嗎? 謝謝!

回答

2

有幾個黑客,他們的可用性取決於您是否要使用meteorological or astronomical seasons。我會提供這兩個,我認爲他們提供了足夠的靈活性。

我將使用您提供的第二個數據,因爲它提供的不僅僅是「冬季」。

txt <- "date   name  count 
2016-11-12 Joe   5 
2016-11-15 Bob   5 
2017-06-15 Nick  12 
2017-10-16 Cate  6" 
dat <- read.table(text = txt, header = TRUE, stringsAsFactors = FALSE) 
dat$date <- as.Date(dat$date) 

當季節嚴格按月定義時,最快的方法效果很好。

metseasons <- c(
    "01" = "Winter", "02" = "Winter", 
    "03" = "Spring", "04" = "Spring", "05" = "Spring", 
    "06" = "Summer", "07" = "Summer", "08" = "Summer", 
    "09" = "Fall", "10" = "Fall", "11" = "Fall", 
    "12" = "Winter" 
) 
metseasons[format(dat$date, "%m")] 
#  11  11  06  10 
# "Fall" "Fall" "Summer" "Fall" 

的天文季節如果您選擇使用日期範圍不由月開始定義你的季節/停止這樣的,這裏是另一個「黑客」:

astroseasons <- as.integer(c("0000", "0320", "0620", "0922", "1221", "1232")) 
astroseasons_labels <- c("Winter", "Spring", "Summer", "Fall", "Winter") 

如果您使用正確的DatePOSIX類型,那麼你包括年,這使得事情少一些通用。有人可能會考慮使用朱利安日期,但在閏年期間,這會產生異常。因此,假設2月28日不是季節性邊界,我正在對月份進行「數字化」。即使R做字符比較就好,cut需要數字,所以我們將它們轉換爲整數。

兩個安全的得分後衛:因爲cut或者是右開(左閉)或右閉(左開),然後我們兩本書,兩端需要延長超越法定日期,ERGO "0000""1232"。還有其他技術可以在此處同樣有效(例如,使用-InfInf,後整合)。

astroseasons_labels[ cut(as.integer(format(dat$date, "%m%d")), astroseasons, labels = FALSE) ] 
# [1] "Fall" "Fall" "Spring" "Fall" 

請注意,第三個日期是在春季時使用天文季節和夏季,否則。

該解決方案可以很容易地進行調整,以考慮南半球或其他季節性偏好/信仰。

編輯:由@Kristofersen's answer(謝謝),我看着基準。 lubridate::month使用POSIXct-to-POSIXlt轉換來提取月份,該月份可能比我的format(x, "%m")方法快10倍以上。因此:

metseasons2 <- c(
    "Winter", "Winter", 
    "Spring", "Spring", "Spring", 
    "Summer", "Summer", "Summer", 
    "Fall", "Fall", "Fall", 
    "Winter" 
) 

注意到as.POSIXlt返回0基礎的幾個月中,我們添加1:

metseasons2[ 1 + as.POSIXlt(dat$date)$mon ] 
# [1] "Fall" "Fall" "Summer" "Fall" 

比較:

library(lubridate) 
library(microbenchmark) 
set.seed(42) 
x <- Sys.Date() + sample(1e3) 
xlt <- as.POSIXlt(x) 

microbenchmark(
    metfmt = metseasons[ format(x, "%m") ], 
    metlt = metseasons2[ 1 + xlt$mon ], 
    astrofmt = astroseasons_labels[ cut(as.integer(format(x, "%m%d")), astroseasons, labels = FALSE) ], 
    astrolt = astroseasons_labels[ cut(100*(1+xlt$mon) + xlt$mday, astroseasons, labels = FALSE) ], 
    lubridate = sapply(month(x), seasons) 
) 
# Unit: microseconds 
#  expr  min  lq  mean median  uq  max neval 
#  metfmt 1952.091 2135.157 2289.63943 2212.1025 2308.1945 3748.832 100 
#  metlt 14.223 16.411 22.51550 20.0575 24.7980 68.924 100 
# astrofmt 2240.547 2454.245 2622.73109 2507.8520 2674.5080 3923.874 100 
# astrolt 42.303 54.702 72.98619 66.1885 89.7095 163.373 100 
# lubridate 5906.963 6473.298 7018.11535 6783.2700 7508.0565 11474.050 100 

因此,使用as.POSIXlt(...)$mon的方法有顯著更快。 (@ Kristofersen的答案可以通過向量化來改進,或許可以使用ifelse,但仍不能與具有或不具有cut的向量查找速度進行比較。)

1

如果你的數據是DF:

# create dataframe for month and corresponding season 
dfSeason <- data.frame(season = c(rep("Winter", 3), rep("Summer", 3), 
rep("Spring", 3), rep("Fall", 3)), 
        month = month.name[c(11,12,1, 5:7, 2:4, 8:10)], 
        stringsAsFactors = F) 

# make date as date 
df$data <- as.Date(df$date) 

# match the month of the date in df (format %B) with month in season 
# then use it to index the season of dfSeason 
df$season <- dfSeason$season[match(format(df$data, "%B"), dfSeason$month)] 
+0

謝謝。當我嘗試執行此操作時,本賽季欄中有所有NA。你知道這可能是爲什麼嗎? – Amanda

+0

我的不好,'match'中的'dfSeason'應該是'dfSeason $ month' – din

1

您可以使用lubridate和函數將月份數量改爲一個季節。

library(lubridate) 

seasons = function(x){ 
    if(x %in% 2:4) return("Spring") 
    if(x %in% 5:7) return("Summer") 
    if(x %in% 8:10) return("Fall") 
    if(x %in% c(11,12,1)) return("Winter") 

} 

dat$Season = sapply(month(dat$date), seasons) 

> dat 
     date name count Season 
1 2016-11-12 Joe  5 Winter 
2 2016-11-15 Bob  5 Winter 
3 2016-06-15 Nick 12 Summer 
4 2016-10-16 Cate  6 Fall