2011-04-16 129 views
11

鑑於我在我的知識數據庫中的以下內容:尋找給定矢量的k-最近鄰居?

1 0 6 20 0 0 6 20 
1 0 3 6 0 0 3 6 
1 0 15 45 0 0 15 45 
1 0 17 44 0 0 17 44 
1 0 2 5 0 0 2 5 

我希望能夠找到下面的向量的最近鄰居:

1 0 5 16 0 0 5 16 

根據距離度量。所以在這種情況下,給定一個特定的閾值,我會發現列出的第一個向量是給定向量的近鄰。目前,我的知識數據庫的大小是數百萬的數量級,因此計算每個點和每個點的距離度量值,然後比較是很昂貴的。除了顯着的加速,如何實現這個目標還有其他選擇嗎?

我幾乎可以接受任何方法,包括在MySQL中使用空間索引(除了我不完全確定如何解決這個問題)或某種哈希(這將是偉大的,但我又不是完全確定)。

+0

我們在我的機器上談到了這一點學習課程,但我真的不知道說什麼不僅僅是評論;然而,我想你可能想看看[kd-trees](http://en.wikipedia.org/wiki/Kd-tree)(它推廣到[metric tress](http://en.wikipedia.org) /維基/ Metric_tree))。對不起,我不能更有幫助。 – 2011-04-16 06:35:44

+0

相關:[數百萬的3D點:如何找到最接近給定點的10個點?](http://stackoverflow.com/q/2486093),[如何找到100維空間中最接近的2個點有500,000點?](http://stackoverflow.com/q/3899097) – jfs 2011-04-16 08:07:38

回答

7

在Python(來自www.comp.mq.edu.au/):

def count_different_values(k_v1s, k_v2s): 
    """kv1s and kv2s should be dictionaries mapping keys to 
    values. count_different_values() returns the number of keys in 
    k_v1s and k_v2s that don't have the same value""" 
    ks = set(k_v1s.iterkeys()) | set(k_v2s.iterkeys()) 
    return sum(1 for k in ks if k_v1s.get(k) != k_v2s.get(k)) 


def sum_square_diffs(x0s, x1s): 
    """x1s and x2s should be equal-lengthed sequences of numbers. 
    sum_square_differences() returns the sum of the squared differences 
    of x1s and x2s.""" 
    sum((pow(x1-x2,2) for x1,x2 in zip(x1s,x2s))) 

def incr(x_c, x, inc=1): 
    """increments the value associated with key x in dictionary x_c 
    by inc, or sets it to inc if key x is not in dictionary x_c.""" 
    x_c[x] = x_c.get(x, 0) + inc 

def count_items(xs, x_c=None): 
    """returns a dictionary x_c whose keys are the items in xs, and 
    whose values are the number of times each item occurs in xs.""" 
    if x_c == None: 
     x_c = {} 
    for x in xs: 
     incr(x_c, x) 
    return x_c 

def second(xy): 
    """returns the second element in a sequence""" 
    return xy[1] 

def most_frequent(xs): 
    """returns the most frequent item in xs""" 
    x_c = count_items(xs) 
    return sorted(x_c.iteritems(), key=second, reverse=True)[0][0] 


class kNN_classifier: 
    """This is a k-nearest-neighbour classifer.""" 
    def __init__(self, train_data, k, distf): 
     self.train_data = train_data 
     self.k = min(k, len(train_data)) 
     self.distf = distf 

    def classify(self, x): 
     Ns = sorted(self.train_data, 
        key=lambda xy: self.distf(xy[0], x)) 
     return most_frequent((y for x,y in Ns[:self.k])) 

    def batch_classify(self, xs): 
     return [self.classify(x) for x in xs] 

def train(train_data, k=1, distf=count_different_values): 
    """Returns a kNN_classifer that contains the data, the number of 
    nearest neighbours k and the distance function""" 
    return kNN_classifier(train_data, k, distf) 

也www.umanitoba.ca/的另一種實現

#!/usr/bin/env python 
# This code is part of the Biopython distribution and governed by its 
# license. Please see the LICENSE file that should have been included 
# as part of this package. 
""" 
This module provides code for doing k-nearest-neighbors classification. 

k Nearest Neighbors is a supervised learning algorithm that classifies 
a new observation based the classes in its surrounding neighborhood. 

Glossary: 
distance The distance between two points in the feature space. 
weight  The importance given to each point for classification. 


Classes: 
kNN   Holds information for a nearest neighbors classifier. 


Functions: 
train  Train a new kNN classifier. 
calculate Calculate the probabilities of each class, given an observation. 
classify  Classify an observation into a class. 

    Weighting Functions: 
equal_weight Every example is given a weight of 1. 

""" 

import numpy 

class kNN: 
    """Holds information necessary to do nearest neighbors classification. 

    Members: 
    classes Set of the possible classes. 
    xs  List of the neighbors. 
    ys  List of the classes that the neighbors belong to. 
    k  Number of neighbors to look at. 

    """ 
    def __init__(self): 
     """kNN()""" 
     self.classes = set() 
     self.xs = [] 
     self.ys = [] 
     self.k = None 

def equal_weight(x, y): 
    """equal_weight(x, y) -> 1""" 
    # everything gets 1 vote 
    return 1 

def train(xs, ys, k, typecode=None): 
    """train(xs, ys, k) -> kNN 

    Train a k nearest neighbors classifier on a training set. xs is a 
    list of observations and ys is a list of the class assignments. 
    Thus, xs and ys should contain the same number of elements. k is 
    the number of neighbors that should be examined when doing the 
    classification. 

    """ 
    knn = kNN() 
    knn.classes = set(ys) 
    knn.xs = numpy.asarray(xs, typecode) 
    knn.ys = ys 
    knn.k = k 
    return knn 

def calculate(knn, x, weight_fn=equal_weight, distance_fn=None): 
    """calculate(knn, x[, weight_fn][, distance_fn]) -> weight dict 

    Calculate the probability for each class. knn is a kNN object. x 
    is the observed data. weight_fn is an optional function that 
    takes x and a training example, and returns a weight. distance_fn 
    is an optional function that takes two points and returns the 
    distance between them. If distance_fn is None (the default), the 
    Euclidean distance is used. Returns a dictionary of the class to 
    the weight given to the class. 

    """ 
    x = numpy.asarray(x) 

    order = [] # list of (distance, index) 
    if distance_fn: 
     for i in range(len(knn.xs)): 
      dist = distance_fn(x, knn.xs[i]) 
      order.append((dist, i)) 
    else: 
     # Default: Use a fast implementation of the Euclidean distance 
     temp = numpy.zeros(len(x)) 
     # Predefining temp allows reuse of this array, making this 
     # function about twice as fast. 
     for i in range(len(knn.xs)): 
      temp[:] = x - knn.xs[i] 
      dist = numpy.sqrt(numpy.dot(temp,temp)) 
      order.append((dist, i)) 
    order.sort() 

    # first 'k' are the ones I want. 
    weights = {} # class -> number of votes 
    for k in knn.classes: 
     weights[k] = 0.0 
    for dist, i in order[:knn.k]: 
     klass = knn.ys[i] 
     weights[klass] = weights[klass] + weight_fn(x, knn.xs[i]) 

    return weights 

def classify(knn, x, weight_fn=equal_weight, distance_fn=None): 
    """classify(knn, x[, weight_fn][, distance_fn]) -> class 

    Classify an observation into a class. If not specified, weight_fn will 
    give all neighbors equal weight. distance_fn is an optional function 
    that takes two points and returns the distance between them. If 
    distance_fn is None (the default), the Euclidean distance is used. 
    """ 
    weights = calculate(
     knn, x, weight_fn=weight_fn, distance_fn=distance_fn) 

    most_class = None 
    most_weight = None 
    for klass, weight in weights.items(): 
     if most_class is None or weight > most_weight: 
      most_class = klass 
      most_weight = weight 
    return most_class 
+0

第一個代碼捕捉的預期格式是什麼?你能提供一個簡單的例子嗎?我不斷收到類型錯誤。謝謝。 – amit 2013-04-06 15:32:28