2017-07-16 388 views
1

我有一個數據集如下,查找距離最近的點在數據集中的所有點 - Python的

Id  Latitude  longitude 
1  25.42   55.47 
2  25.39   55.47 
3  24.48   54.38 
4  24.51   54.54 

我想找到每一個點的數據集的最近距離。我發現在互聯網上追隨距離函數,

from math import radians, cos, sin, asin, sqrt 
def distance(lon1, lat1, lon2, lat2): 
    """ 
    Calculate the great circle distance between two points 
    on the earth (specified in decimal degrees) 
    """ 
    # convert decimal degrees to radians 
    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2]) 
    # haversine formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2 
    c = 2 * asin(sqrt(a)) 
    km = 6367 * c 
    return km 

我使用下面的函數,

shortest_distance = [] 
for i in range(1,len(data)): 
    distance1 = [] 
    for j in range(1,len(data)): 
     distance1.append(distance(data['Longitude'][i], data['Latitude'][i], data['Longitude'][j], data['Latitude'][j])) 
    shortest_distance.append(min(distance1)) 

但這段代碼是爲每個條目循環兩次,並返回N^2次迭代,反過來它非常緩慢。我的數據集包含近100萬條記錄,每次遍歷所有元素兩次變得非常昂貴。

我想找到更好的方法找出每一行的最近點。任何人都可以幫助我找到解決這個問題的方法嗎?

謝謝

回答

2

查找最近的N指向一個給定的點的蠻力方法是O(N) - 你」 d必須檢查每個點。 相比之下,如果N點存儲在KD-tree中,則找到最近點的平均值爲O(log(N))。 建造KD樹還需要額外的一次性費用,這需要O(N)時間。

如果您需要重複此過程N次,那麼蠻力方法是O(N**2)而kd-tree方法是O(N*log(N))。 因此,對於足夠大的N,KD樹將擊敗暴力方法。

有關最近鄰算法(包括KD樹)的更多信息,請參閱here


下方(功能using_kdtree)是使用scipy.spatial.kdtree計算最近的鄰居的大圓arclengths的方式。

scipy.spatial.kdtree使用點之間的歐幾里德距離,但有一個formula用於將球體上的點之間的歐氏距離轉換爲大圓arclength(給定球體的半徑)。 所以我們的想法是將緯度/經度數據轉換爲笛卡爾座標,使用KDTree找到最近的鄰居,然後應用great circle distance formula來獲得所需的結果。


這裏是一些基準。使用N = 100,using_kdtreeorig(強力)方法快39倍。

In [180]: %timeit using_kdtree(data) 
100 loops, best of 3: 18.6 ms per loop 

In [181]: %timeit using_sklearn(data) 
1 loop, best of 3: 214 ms per loop 

In [179]: %timeit orig(data) 
1 loop, best of 3: 728 ms per loop 

對於N = 10000

In [5]: %timeit using_kdtree(data) 
1 loop, best of 3: 2.78 s per loop 

In [6]: %timeit using_sklearn(data) 
1 loop, best of 3: 1min 15s per loop 

In [7]: %timeit orig(data) 
# untested; too slow 

由於using_kdtreeO(N log(N))origO(N**2),通過 的因子,其using_kdtree快於orig將成長爲N,的 data長度增長。


import numpy as np 
import scipy.spatial as spatial 
import pandas as pd 
import sklearn.neighbors as neighbors 
from math import radians, cos, sin, asin, sqrt 

R = 6367 

def using_kdtree(data): 
    "Based on https://stackoverflow.com/q/43020919/190597" 
    def dist_to_arclength(chord_length): 
     """ 
     https://en.wikipedia.org/wiki/Great-circle_distance 
     Convert Euclidean chord length to great circle arc length 
     """ 
     central_angle = 2*np.arcsin(chord_length/(2.0*R)) 
     arclength = R*central_angle 
     return arclength 

    phi = np.deg2rad(data['Latitude']) 
    theta = np.deg2rad(data['Longitude']) 
    data['x'] = R * np.cos(phi) * np.cos(theta) 
    data['y'] = R * np.cos(phi) * np.sin(theta) 
    data['z'] = R * np.sin(phi) 
    tree = spatial.KDTree(data[['x', 'y','z']]) 
    distance, index = tree.query(data[['x', 'y','z']], k=2) 
    return dist_to_arclength(distance[:, 1]) 

def orig(data): 
    def distance(lon1, lat1, lon2, lat2): 
     """ 
     Calculate the great circle distance between two points 
     on the earth (specified in decimal degrees) 
     """ 
     # convert decimal degrees to radians 
     lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2]) 
     # haversine formula 
     dlon = lon2 - lon1 
     dlat = lat2 - lat1 
     a = sin(dlat/2.0)**2 + cos(lat1) * cos(lat2) * sin(dlon/2.0)**2 
     c = 2 * asin(sqrt(a)) 
     km = R * c 
     return km 

    shortest_distance = [] 
    for i in range(len(data)): 
     distance1 = [] 
     for j in range(len(data)): 
      if i == j: continue 
      distance1.append(distance(data['Longitude'][i], data['Latitude'][i], 
             data['Longitude'][j], data['Latitude'][j])) 
     shortest_distance.append(min(distance1)) 
    return shortest_distance 


def using_sklearn(data): 
    """ 
    Based on https://stackoverflow.com/a/45127250/190597 (Jonas Adler) 
    """ 
    def distance(p1, p2): 
     """ 
     Calculate the great circle distance between two points 
     on the earth (specified in decimal degrees) 
     """ 
     lon1, lat1 = p1 
     lon2, lat2 = p2 
     # convert decimal degrees to radians 
     lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2]) 
     # haversine formula 
     dlon = lon2 - lon1 
     dlat = lat2 - lat1 
     a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2 
     c = 2 * np.arcsin(np.sqrt(a)) 
     km = R * c 
     return km 
    points = data[['Longitude', 'Latitude']] 
    nbrs = neighbors.NearestNeighbors(n_neighbors=2, metric=distance).fit(points) 
    distances, indices = nbrs.kneighbors(points) 
    result = distances[:, 1] 
    return result 

np.random.seed(2017) 
N = 1000 
data = pd.DataFrame({'Latitude':np.random.uniform(-90,90,size=N), 
        'Longitude':np.random.uniform(0,360,size=N)}) 

expected = orig(data) 
for func in [using_kdtree, using_sklearn]: 
    result = func(data) 
    assert np.allclose(expected, result) 
0

您可以使用字典散列一些計算。您的代碼多次計算A到B的距離(A和B是數據集中的2個任意點)。

要麼實現自己的緩存:

from math import radians, cos, sin, asin, sqrt 

dist_cache = {} 
def distance(lon1, lat1, lon2, lat2): 
    """ 
    Calculate the great circle distance between two points 
    on the earth (specified in decimal degrees) 
    """ 

    try: 
     return dist_cache[(lon1, lat1, lon2, lat2)] 
    except KeyError: 
     # convert decimal degrees to radians 
     lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2]) 
     # haversine formula 
     dlon = lon2 - lon1 
     dlat = lat2 - lat1 
     a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2 
     c = 2 * asin(sqrt(a)) 
     km = 6367 * c 
     dist_cache[(lon1, lat1, lon2, lat2)] = km 
     return km 

或者使用lru_cache

from math import radians, cos, sin, asin, sqrt 
from functools import lru_cache 

@lru_cache 
def distance(lon1, lat1, lon2, lat2): 
    """ 
    Calculate the great circle distance between two points 
    on the earth (specified in decimal degrees) 
    """ 
    # convert decimal degrees to radians 
    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2]) 
    # haversine formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2 
    c = 2 * asin(sqrt(a)) 
    km = 6367 * c 
    return km 
+0

這是我第一次聽到這樣的事情。我如何在數據框中使用它? – haimen

+0

@haimen你是什麼意思?我已經向你展示瞭如何修改'distance'函數來使用緩存。 – DeepSpace

+0

對不起。我的壞。讓我試試看看它是如何工作的。 – haimen

3

你可以做到這一點very efficiently通過調用這個實現智能算法庫,一個例子是sklearn這有一個NearestNeighbors方法,正是這個。修改,以執行此操作代碼的

實施例:

from sklearn.neighbors import NearestNeighbors 
import numpy as np 

def distance(p1, p2): 
    """ 
    Calculate the great circle distance between two points 
    on the earth (specified in decimal degrees) 
    """ 
    lon1, lat1 = p1 
    lon2, lat2 = p2 
    # convert decimal degrees to radians 
    lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2]) 
    # haversine formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2 
    c = 2 * np.arcsin(np.sqrt(a)) 
    km = 6367 * c 
    return km 

points = [[25.42, 55.47], 
      [25.39, 55.47], 
      [24.48, 54.38], 
      [24.51, 54.54]] 

nbrs = NearestNeighbors(n_neighbors=2, metric=distance).fit(points) 

distances, indices = nbrs.kneighbors(points) 

result = distances[:, 1] 

其給出

>>> result 
array([ 1.889697 , 1.889697 , 17.88530556, 17.88530556]) 
+0

非常感謝解決方案。它似乎比我以前使用的循環代碼稍快。如果你不介意,你能解釋我如何比循環更快嗎? – haimen