2017-03-17 98 views
0

我正在嘗試將座標轉換爲最接近的座標。從某種意義上說,我正在做一次kmeans聚類迭代,有1222個質心。在下面,我有一個功能可以做到這一點,不完美和太慢。我尋求幫助改善這一功能:在R中,將浮點座標離散到最近的座標

discretizeCourt <- function(x_loc, y_loc) { 

    # create the dataframe of points that I want to round coordinates to 
    y <- seq(0, 50, by = 2) 
    x1 <- seq(1, 93, by = 2) 
    x2 <- seq(2, 94, by = 2) 
    x <- c(x1, x2) 

    coordinates <- data.frame(
    x = rep(x, 13), 
    y = rep(y, each = length(x1)), 
    count = 0 
) 

    # loop over each point in x_loc and y_loc 
    # increment the count column whenever a point is 'near' that column  
    for(i in 1:length(x_loc)) { 
    this_x = x_loc[i] 
    this_y = y_loc[i] 

    coordinates[coordinates$x > this_x-1 & 
       coordinates$x < this_x+1 & 
       coordinates$y > this_y-1 & 
       coordinates$y < this_y+1, ]$count = 
     coordinates[coordinates$x > this_x-1 & 
        coordinates$x < this_x+1 & 
        coordinates$y > this_y-1 & 
        coordinates$y < this_y+1, ]$count + 1 
    } 
} 

這裏是我工作的一些測試數據:

> dput(head(x_loc, n = 50)) 
c(13.57165, 13.61702, 13.66478, 13.70833, 13.75272, 13.7946, 
13.83851, 13.86792, 13.8973, 13.93906, 13.98099, 14.02396, 14.06338, 
14.10872, 14.15412, 14.2015, 14.26116, 14.30871, 14.35056, 14.39536, 
14.43964, 14.48442, 14.5324, 14.57675, 14.62267, 14.66972, 14.71443, 
14.75383, 14.79012, 14.82455, 14.85587, 14.87557, 14.90737, 14.9446, 
14.97763, 15.01079, 15.04086, 15.06752, 15.09516, 15.12394, 15.15191, 
15.18061, 15.20413, 15.22896, 15.25411, 15.28108, 15.3077, 15.33578, 
15.36507, 15.39272) 

> dput(head(y_loc, n = 50)) 
c(25.18298, 25.17431, 25.17784, 25.18865, 25.20188, 25.22865, 
25.26254, 25.22778, 25.20162, 25.25191, 25.3044, 25.35787, 25.40347, 
25.46049, 25.5199, 25.57132, 25.6773, 25.69842, 25.73877, 25.78383, 
25.82168, 25.86067, 25.89984, 25.93067, 25.96943, 26.01083, 26.05861, 
26.11965, 26.18428, 26.25347, 26.3352, 26.35756, 26.4682, 26.55412, 
26.63745, 26.72157, 26.80021, 26.8691, 26.93522, 26.98879, 27.03783, 
27.07818, 27.03786, 26.9909, 26.93697, 26.87916, 26.81606, 26.74908, 
26.67815, 26.60898) 

我的實際x_loc和y_loc文件〜60000個座標,我有成千上萬的文件每個都有〜60000個座標,所以這是很多工作。我很確定函數運行緩慢的原因是我索引/遞增的方式。

計數不完美。一個技術上更好的方法是遍歷所有60000個點(對於這個例子只有50個點以上),並且對於每個點,計算該點與座標數據框中每個點之間的距離(1222點)。然而這就是60000 * 1222的計算,只是針對這一組點,這太高了。

將不勝感激任何幫助! 謝謝,

編輯:我正在將我的數據框/向量轉換爲2矩陣,並向量化整個方法,會讓你知道它是否工作。

回答

1

如果要以比解決方案更快的速度處理矩陣,請考慮使用data.table庫。請看下面的例子:

df <- data.table(x_loc, y_loc) # Your data.frame is turned into a data.table 
df$row.idx <- 1:nrow(df) # This column is used as ID for each sample point. 

現在,我們可以找到每個點的正確座標。稍後我們可以計算出某個座標點有多少個點。我們首先保持coordinates數據幀:

y <- seq(0, 50, by = 2) 
x1 <- seq(1, 93, by = 2) 
x2 <- seq(2, 94, by = 2) 
x <- c(x1, x2) 

coordinates <- data.frame(
    x = rep(x, 13), 
    y = rep(y, each = length(x1)), 
    count = 0 
) 
coordinates$row <- 1:nrow(coordinates) # Similar to yours. However, this time we are interested in seeing which points belong to this coordinate. 

現在,我們定義它檢查的座標,並返回問題點的一個單位距離內的一個函數。

f <- function(this_x, this_y, coordinates) { 
    res <- coordinates[coordinates$x > this_x-1 & 
          coordinates$x < this_x+1 & 
          coordinates$y > this_y-1 & 
          coordinates$y < this_y+1, ]$row 
    res 
} 

對於每一個點,我們發現其右側座標:

df[, coordinate.idx := f(x_loc, y_loc), by = row.idx] 
df[, row.idx := NULL] 

df包含以下變量:(x_loc, y_loc, coordinate.idx)。您可以使用它填充coordinates$count。即使是60000分,也不會超過1秒。

for(i in 1:nrow(coordinates)) { 
    coordinates$count = length(which(df$coordinate.idx == i)) 
}