2016-07-14 61 views
0

我想繪製一個時間序列,其中顯示了每小時的log秒數。我首先嚐試從dataframe中分割出每logdate,以便統計每小時的log秒數。從數據幀中提取日期並用R繪製時間序列

我有以下dataframe

[Fri Jun 1 15:56:37 1995] httpd: send aborted for disarray.demon.co.uk 
[Fri Jun 1 16:29:29 1995] httpd: send aborted for ansc86024.usask.ca 
[Fri Jun 1 16:31:42 1995] httpd: send aborted for 194.20.24.70 
[Fri Jun 1 16:34:11 1995] httpd: send aborted for sw24-70.iol.it 
[Fri Jun 1 16:41:02 1995] httpd: send aborted for educ026.usask.ca 
[Fri Jun 1 16:41:13 1995] httpd: send aborted for educ026.usask.ca 
[Fri Jun 1 16:41:13 1995] httpd: send aborted for sw24-70.iol.it 
[Fri Jun 1 16:45:07 1995] httpd: send aborted for 128.233.18.38 
[Fri Jun 1 17:26:50 1995] httpd: send aborted for pc117c.nwrel.org 
[Fri Jun 1 17:46:53 1995] httpd: send aborted for geoff.usask.ca 
[Fri Jun 2 17:57:09 1995] httpd: send aborted for piweba3y.prodigy.com 
[Fri Jun 2 17:57:50 1995] httpd: send aborted for piweba3y.prodigy.com 
[Fri Jun 2 18:10:15 1995] httpd: send aborted for 193.74.92.109 
[Fri Jun 2 20:14:30 1995] httpd: send aborted for 128.233.13.41 
[Fri Jun 2 20:15:59 1995] httpd: send aborted for peter.net4.io.org 
[Fri Jun 2 21:11:54 1995] httpd: send aborted for ped374.usask.ca 

我想用log數以下情節Š小時:

enter image description here

我嘗試添加使用的dategsub功能:

df$date <- gsub(".+[(.*)]","",df[0]) 
+0

所以你要提取日期然後按小時分組?請向我們展示您已經嘗試過的代碼。 – eipi10

+0

@ eipi10準確地說,我想提取日期和按小時分組,但我沒有'日期'的特定列我應該從行中提取日期並將它們轉換爲時間戳或其他格式 –

+0

@ eipi10我試圖提取日期使用以下正則表達式公式:'df $ date < - gsub(「。+ [(。*)]」,「」,df [0])' –

回答

1

如何:

# Data in form of a string vector 
dat = c("[Fri Jun 1 15:56:37 1995] httpd: send aborted for disarray.demon.co.uk", 
     "[Fri Jun 1 16:29:29 1995] httpd: send aborted for ansc86024.usask.ca", 
     "[Fri Jun 1 16:31:42 1995] httpd: send aborted for 194.20.24.70", 
     "[Fri Jun 1 16:34:11 1995] httpd: send aborted for sw24-70.iol.it", 
     "[Fri Jun 1 16:41:02 1995] httpd: send aborted for educ026.usask.ca", 
     "[Fri Jun 1 16:41:13 1995] httpd: send aborted for educ026.usask.ca", 
     "[Fri Jun 1 16:41:13 1995] httpd: send aborted for sw24-70.iol.it", 
     "[Fri Jun 1 16:45:07 1995] httpd: send aborted for 128.233.18.38", 
     "[Fri Jun 1 17:26:50 1995] httpd: send aborted for pc117c.nwrel.org", 
     "[Fri Jun 1 17:46:53 1995] httpd: send aborted for geoff.usask.ca", 
     "[Fri Jun 2 17:57:09 1995] httpd: send aborted for piweba3y.prodigy.com", 
     "[Fri Jun 2 17:57:50 1995] httpd: send aborted for piweba3y.prodigy.com", 
     "[Fri Jun 2 18:10:15 1995] httpd: send aborted for 193.74.92.109", 
     "[Fri Jun 2 20:14:30 1995] httpd: send aborted for 128.233.13.41", 
     "[Fri Jun 2 20:15:59 1995] httpd: send aborted for peter.net4.io.org", 
     "[Fri Jun 2 21:11:54 1995] httpd: send aborted for ped374.usask.ca") 

library(dplyr) 
library(lubridate) 

提取日期字符串:

dat = data.frame(date.string = gsub(".{5}(.*)\\].*", "\\1", dat)) 

轉換日期字符串POSIXct日期時間格式:

dat$date = as.POSIXct(dat$date.string, format= "%b %e %H:%M:%S %Y") 

現在,按小時總結。我們扔掉分,秒,這樣我們就可以只按日期組按小時獲得數:

datByHour = dat %>% 
    mutate(date = as.POSIXct(paste0(paste(year(date),month(date),day(date),sep="-"), 
            " ", 
            paste(hour(date),"00:00", sep=":")))) %>% 
    group_by(date) %>% 
    tally 

datByHour 
    date  n 
1 1995-06-01 15:00:00  1 
2 1995-06-01 16:00:00  7 
3 1995-06-01 17:00:00  2 
4 1995-06-02 17:00:00  2 
5 1995-06-02 18:00:00  1 
6 1995-06-02 20:00:00  2 
7 1995-06-02 21:00:00  1 

情節小時數:

ggplot(datByHour, aes(date, n)) + 
    geom_line(aes(group=1)) + 
    scale_x_datetime(date_labels="%b %e, %Y: %H") 
+0

我收到以下錯誤消息:'> df $ date = as.POSixct (df $ date.string,format =「%b%e%H:%M:%S%Y」) strptime(x,format,tz = tz)中的錯誤: 輸入字符串太長' –

+0

I之前沒有看到過這個錯誤。不確定。 – eipi10

+0

太棒了,它工作正常 –