總結一個值出現在2列中任何一個的次數

我有一個大的數據集 - 大約32mil行。我有關於電話號碼，電話的來源和目的地的信息。總結一個值出現在2列中任何一個的次數

對於每個電話號碼，我想要統計它出現的次數或者作爲起源或作爲目的地。

的示例數據表如下：

library(data.table) 
dt <- data.table(Tel=seq(1,5,1), Origin=seq(1,5,1), Destination=seq(3,7,1)) 

    Tel Origin Destination 
1: 1  1   3 
2: 2  2   4 
3: 3  3   5 
4: 4  4   6 
5: 5  5   7

我已經工作的代碼，但它的時間太長了我的數據，因爲它涉及到一個for循環。我怎樣才能優化它？

這：

for (i in unique(dt$Tel)){ 
    index <- (dt$Origin == i | dt$Destination == i) 
    dt[dt$Tel ==i, "N"] <- sum(index) 
}

結果：

Tel Origin Destination N 
1: 1  1   3 1 
2: 2  2   4 1 
3: 3  3   5 2 
4: 4  4   6 2 
5: 5  5   7 2

其中N告知電話= 1出現1，電話= 2 1出現，電話= 3,4和5分別出現2倍。

來源

2017-02-22 Raluca Gui

請注意，它不是'for'循環，是獲得所需的列順序問題*本身*，但你如何執行操作。 – lmo

也許你應該考慮在這裏使用圖論，用igraph軟件包（電話號碼作爲節點，呼叫作爲有向邊緣）。 – Frank

我們可以做一個melt和match

dt[, N := melt(dt, id.var = "Tel")[, tabulate(match(value, Tel))]]

或者另一種選擇是通過列2和3環，使用%in%檢查在「電話」的值是否存在，然後用Reduce和+得到邏輯元素的總和爲每個「電話」，分配（:=）的值，以「N」

dt[, N := Reduce(`+`, lapply(.SD, function(x) Tel %in% x)), .SDcols = 2:3] 
dt 
# Tel Origin Destination N 
#1: 1  1   3 1 
#2: 2  2   4 1 
#3: 3  3   5 2 
#4: 4  4   6 2 
#5: 5  5   7 2

來源

2017-02-22 12:57:19 akrun

但您的代碼產生了以下結果，這是不正確的。我得到了前兩個電話號碼N = 3和N = 4，而不是1，因爲它應該是。但我會再次檢查。 –

@RalucaGui我的代碼給出了你在帖子 – akrun

中顯示的預期輸出正確，我的錯誤！代碼運行順利。謝謝！ –

第二種方法構造一個臨時data.table，然後加入到原始數據表中。這比@ akrun的效率更高，可能效率更低，但對於查看效果會很有用。

# get temporary data.table as the sum of origin and destination frequencies 
temp <- setnames(data.table(table(unlist(dt[, .(Origin, Destination)], use.names=FALSE))), 
       c("Tel", "N")) 
# turn the variables into integers (Tel is the name of the table above, and thus character) 
temp <- temp[, lapply(temp, as.integer)]

現在，加入了原始表上

dt <- temp[dt, on="Tel"] 
dt 
    Tel N Origin Destination 
1: 1 1  1   3 
2: 2 1  2   4 
3: 3 2  3   5 
4: 4 2  4   6 
5: 5 2  5   7

可以使用setcolorder

setcolorder(dt, c("Tel", "Origin", "Destination", "N"))

來源

2017-02-22 13:21:08 lmo

完美。謝謝。而且速度也很快！ –

總結一個值出現在2列中任何一個的次數

回答

相關問題