2016-07-16 26 views
0

我得到了旅遊交易數據集是這樣的(約56萬人次):數據幀1如何使用R提供每週簡介?

ID  START TIME   DATE   ORIGIN DESTINATION  DAY 
1005   9.10   2012-01-02   A  B   Monday 
1005   18.15   2012-01-02   B  A   Monday 
1005   9.05   2012-01-08   A  B   Sunday 
1005   17.05   2012-01-08   B  A   Sunday 
1010   8.00   2012-01-09   A  C   Monday 
1010   12.00   2012-01-09   C  A   Monday 
1013   13.15   2012-01-10   D  E   Tuesday 
1013   15.30   2012-01-10   E  G   Tuesday 
1013   9.06   2012-01-12   D  E   Thursday 
...   ...   2012-..-..   .  .   ... 

和ID指數像這樣(約1986年的ID):數據幀2

ID 
1005 
1010 
1013 
1015 
1030 
1034 
1036 
1031 
1040 
... 

我想創建一個基於這兩個數據框的每週旅行概況。我不知道我是否是對的,但我想這些代碼:

weekday = c("Sunday", "Monday","Tuesday", "Wednesday", "Thursday", "Friday", "Saturday") 
    br = seq(0,23,by=1) 
ranges = paste(head(br,-1), br[-1], sep="_") 

      for (i in dataframe2$ID) { 

        for (n in weekday){ 
        x= filter(dataframe1,dataframe1$ID %in% i & dataframe1$DAY %in% n) 
        freq = hist(as.numeric(x), br, include.lowest=TRUE, plot=FALSE) 
        df = as.data.frame(t(data.frame(frequency = freq$counts))) 
        df$i = i 
        df$n = n 
        colnames(df) = c(as.character(ranges),"ID","Day") 
        write.table(head(df),file="testdata1.csv", append=TRUE,sep=",",col.names=FALSE,row.names=FALSE) 
        } 
       } 

我想和包含其每週的行程頻率的CSV表來結束。我也想問問是否有簡單的方法來簡化這項任務。

ID  0_1 1_2 2_3 3_4 4_5 5_6 6_7 7_8 8_9 9_10 10_11 11_12 12_13 13_14 14_15 15_16 16_17 17_18 18_19 19_20 20_21 21_22 22_23 Day 
1005 0 0 0 0 0 0 0 0 0 1  0  0  0  0  0  0  0  1  0  0  0  0  0 Sunday 
1005 0 0 0 0 0 0 0 0 0 1  0  0  0  0  0  0  0  1  0  0  0  0  0 Monday 
1005 0 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0  0 Tuesday 
1005                               Wednesday 
1005                               Thursday 
1005                               Friday 
1005                              Saturday 
1010                               Sunday 
1010 
1010 
1010 
1010 
1010 
1010 
到底

我想製作一個圖表是這樣的: enter image description here

+0

它的更好,如果你'dput'您的數據爲您圖中的數據總結 –

回答

1

這可以在基礎R使用功能xtabs做,但它可能是一個有點更清楚,如果我們做到這一點使用dplyrtidyr包。通過這種方法,weekday被創建爲R因子變量。然後使用dplyr函數mutateDAY轉換爲因子並將START_TIME轉換爲整數。我們接下來使用tidyr包中的complete來創建一個新的擴展數據幀,其中每個值爲ID,DAYSTART_TIME,使用它們的完整值範圍(例如每個ID的行,對於0:23中的每個開始時間和一週中的每一天,他們存在DATEORIGIN,和DESTINATION被使用的值;否則DATE, ORIGIN,DESTINATION列具有NA值每ID, DAY,START_TIME,行程的數量被計算爲行的總和,其。沒有NA的值爲DATE並存儲在Freqspread函數來自tidyr用於將Freq的每個不同值轉換爲單獨的列。最後分配適當的列名稱,按照請求的順序排列列,並將寫入文件的數據框以csv的形式寫入。

library(dplyr) 
    library(tidyr) 
# 
# input data is in df 
# convert colunm name START TIME to syntactically correct version START_TIME 
# 
    colnames(df)[2] <- "START_TIME" 
# 
# define weekday as a factor with the days of week 
# 
    weekday <- c("Sunday", "Monday","Tuesday", "Wednesday", "Thursday", "Friday", "Saturday") 
    weekday <- factor(weekday, levels=weekday) 
# 
# sum number for trips by ID, DAY, and START_TIME 
# 
    trip_freq <- df %>% mutate(DAY = factor(DAY, levels=levels(weekday)), 
           START_TIME=floor(START_TIME)) %>% 
         complete(ID, DAY=weekday, START_TIME=0:23) %>% 
         group_by(ID, DAY, START_TIME) %>% 
         summarise(Freq = sum(!is.na(DATE))) 
    trip_freq_tbl <- trip_freq %>% spread(key = START_TIME, value=Freq) 
# 
# name and re-arrange columns 
# 
    colnames(trip_freq_tbl) <- c("ID", "Day", paste(0:23,1:24,sep="_")) 
    trip_freq_tbl <- cbind(trip_freq_tbl[,-2], Day=trip_freq_tbl[,"Day"])    
# 
# write trip_freq as csv fle 
# 
    write.table(trip_freq_tbl, file="testdata1.csv", sep=",", row.names=FALSE)  

可以進一步與

# 
# summarize the data for the plot 
# 
    trip_freq_plot <- trip_freq %>% group_by(DAY, START_TIME) %>% 
            summarize(Cnt = sum(Freq))