2012-01-08 126 views
11

我想在許多數據點上做一個高斯擬合。例如。我有一個256 x 262144數據陣列。 256點需要擬合成高斯分佈,我需要262144個點。如何快速執行多個數據集的最小二乘擬合?

有時高斯分佈的峯值超出了數據範圍,因此要得到準確的平均結果曲線擬合是最好的方法。即使峯值在該範圍內,曲線擬合也會提供更好的西格馬,因爲其他數據不在該範圍內。

我有這個工作的一個數據點,使用代碼從http://www.scipy.org/Cookbook/FittingData

我試圖重複這個算法,但它看起來像要花費43分鐘的時間來解決這個問題。是否有一個已經寫好的並行或更高效的快速方法?

from scipy import optimize                                   
from numpy import *                                     
import numpy                                       
# Fitting code taken from: http://www.scipy.org/Cookbook/FittingData                         

class Parameter:                                      
    def __init__(self, value):                                 
      self.value = value                                 

    def set(self, value):                                  
      self.value = value                                 

    def __call__(self):                                   
      return self.value                                 


def fit(function, parameters, y, x = None):                               
    def f(params):                                    
      i = 0                                    
      for p in parameters:                                 
        p.set(params[i])                                
        i += 1                                  
      return y - function(x)                                

    if x is None: x = arange(y.shape[0])                               
    p = [param() for param in parameters]                              
    optimize.leastsq(f, p)                                  


def nd_fit(function, parameters, y, x = None, axis=0):                            
    """                                       
    Tries to an n-dimensional array to the data as though each point is a new dataset valid across the appropriate axis.           
    """                                       
    y = y.swapaxes(0, axis)                                  
    shape = y.shape                                    
    axis_of_interest_len = shape[0]                                
    prod = numpy.array(shape[1:]).prod()                               
    y = y.reshape(axis_of_interest_len, prod)                             

    params = numpy.zeros([len(parameters), prod])                            

    for i in range(prod):                                  
      print "at %d of %d"%(i, prod)                              
      fit(function, parameters, y[:,i], x)                             
      for p in range(len(parameters)):                              
        params[p, i] = parameters[p]()                            

    shape[0] = len(parameters)                                 
    params = params.reshape(shape)                                
    return params                                    

請注意,數據不一定是256x262144,我已經做了一些在nd_fit furtging使這項工作。

我用得到這個工作的代碼是

from curve_fitting import * 
import numpy 
frames = numpy.load("data.npy") 
y = frames[:,0,0,20,40] 
x = range(0, 512, 2) 
mu = Parameter(x[argmax(y)]) 
height = Parameter(max(y)) 
sigma = Parameter(50) 
def f(x): return height() * exp (-((x - mu())/sigma()) ** 2) 

ls_data = nd_fit(f, [mu, sigma, height], frames, x, 0) 

注:由@JoeKington貼在下面的解決方案是偉大的,解決了真快。但是,除非高斯的顯着區域位於適當區域內,否則似乎不起作用。我將不得不測試平均值是否仍然準確,因爲這是我使用它的主要內容。 Analysis of gaussian distribution estimations

+0

你能發表你使用的代碼嗎? – 2012-01-08 20:37:30

回答

17

最簡單的事情就是線性化問題。您正在使用非線性迭代方法,它將比線性最小二乘解決方案慢。

基本上,您有:

y = height * exp(-(x - mu)^2/(2 * sigma^2))

爲了使這是一個線性方程,取(自然)日誌雙方:

ln(y) = ln(height) - (x - mu)^2/(2 * sigma^2) 

這則簡化爲多項式:

ln(y) = -x^2/(2 * sigma^2) + x * mu/sigma^2 - mu^2/sigma^2 + ln(height) 

我們可以用更簡單的形式:

ln(y) = A * x^2 + B * x + C 

其中:

A = 1/(2 * sigma^2) 
B = mu/(2 * sigma^2) 
C = mu^2/sigma^2 + ln(height) 

然而,有一個陷阱。當分配的「尾巴」中存在噪音時,這將變得不穩定。

因此,我們只需要使用分佈附近的「峯值」附近的數據。僅包含在擬閤中高於某個閾值的數據是很容易的。在這個例子中,我只包含了大於我們擬合的給定高斯曲線最大觀測值20%的數據。

但是,一旦我們完成了這一步,速度會非常快。求解262144個不同的高斯曲線只需要1分鐘左右(如果你在一些很大的東西上運行它,一定要刪除代碼的繪圖部分)。如果你想要並行化,也很容易......

import numpy as np 
import matplotlib.pyplot as plt 
import matplotlib as mpl 
import itertools 

def main(): 
    x, data = generate_data(256, 6) 
    model = [invert(x, y) for y in data.T] 
    sigma, mu, height = [np.array(item) for item in zip(*model)] 
    prediction = gaussian(x, sigma, mu, height) 

    plot(x, data, linestyle='none', marker='o') 
    plot(x, prediction, linestyle='-') 
    plt.show() 

def invert(x, y): 
    # Use only data within the "peak" (20% of the max value...) 
    key_points = y > (0.2 * y.max()) 
    x = x[key_points] 
    y = y[key_points] 

    # Fit a 2nd order polynomial to the log of the observed values 
    A, B, C = np.polyfit(x, np.log(y), 2) 

    # Solve for the desired parameters... 
    sigma = np.sqrt(-1/(2.0 * A)) 
    mu = B * sigma**2 
    height = np.exp(C + 0.5 * mu**2/sigma**2) 
    return sigma, mu, height 

def generate_data(numpoints, numcurves): 
    np.random.seed(3) 
    x = np.linspace(0, 500, numpoints) 

    height = 100 * np.random.random(numcurves) 
    mu = 200 * np.random.random(numcurves) + 200 
    sigma = 100 * np.random.random(numcurves) + 0.1 
    data = gaussian(x, sigma, mu, height) 

    noise = 5 * (np.random.random(data.shape) - 0.5) 
    return x, data + noise 

def gaussian(x, sigma, mu, height): 
    data = -np.subtract.outer(x, mu)**2/(2 * sigma**2) 
    return height * np.exp(data) 

def plot(x, ydata, ax=None, **kwargs): 
    if ax is None: 
     ax = plt.gca() 
    colorcycle = itertools.cycle(mpl.rcParams['axes.color_cycle']) 
    for y, color in zip(ydata.T, colorcycle): 
     ax.plot(x, y, color=color, **kwargs) 

main() 

enter image description here

我們需要改變的水貨版本的唯一的事情是主要的功能。 (我們還需要一個虛擬函數,因爲multiprocessing.Pool.imap不能提供額外的參數到它的功能......),這將是這個樣子:

def parallel_main(): 
    import multiprocessing 
    p = multiprocessing.Pool() 
    x, data = generate_data(256, 262144) 
    args = itertools.izip(itertools.repeat(x), data.T) 
    model = p.imap(parallel_func, args, chunksize=500) 
    sigma, mu, height = [np.array(item) for item in zip(*model)] 
    prediction = gaussian(x, sigma, mu, height) 

def parallel_func(args): 
    return invert(*args) 

編輯:在情況下,簡單的多項式擬合不嘗試用@tslisten共享的y值,as mentioned in the link/paper來加重問題(Stefan van der Walt實現,儘管我的實現有點不同)。

import numpy as np 
import matplotlib.pyplot as plt 
import matplotlib as mpl 
import itertools 

def main(): 
    def run(x, data, func, threshold=0): 
     model = [func(x, y, threshold=threshold) for y in data.T] 
     sigma, mu, height = [np.array(item) for item in zip(*model)] 
     prediction = gaussian(x, sigma, mu, height) 

     plt.figure() 
     plot(x, data, linestyle='none', marker='o', markersize=4) 
     plot(x, prediction, linestyle='-', lw=2) 

    x, data = generate_data(256, 6, noise=100) 
    threshold = 50 

    run(x, data, weighted_invert, threshold=threshold) 
    plt.title('Weighted by Y-Value') 

    run(x, data, invert, threshold=threshold) 
    plt.title('Un-weighted Linear Inverse' 

    plt.show() 

def invert(x, y, threshold=0): 
    mask = y > threshold 
    x, y = x[mask], y[mask] 

    # Fit a 2nd order polynomial to the log of the observed values 
    A, B, C = np.polyfit(x, np.log(y), 2) 

    # Solve for the desired parameters... 
    sigma, mu, height = poly_to_gauss(A,B,C) 
    return sigma, mu, height 

def poly_to_gauss(A,B,C): 
    sigma = np.sqrt(-1/(2.0 * A)) 
    mu = B * sigma**2 
    height = np.exp(C + 0.5 * mu**2/sigma**2) 
    return sigma, mu, height 

def weighted_invert(x, y, weights=None, threshold=0): 
    mask = y > threshold 
    x,y = x[mask], y[mask] 
    if weights is None: 
     weights = y 
    else: 
     weights = weights[mask] 

    d = np.log(y) 
    G = np.ones((x.size, 3), dtype=np.float) 
    G[:,0] = x**2 
    G[:,1] = x 

    model,_,_,_ = np.linalg.lstsq((G.T*weights**2).T, d*weights**2) 
    return poly_to_gauss(*model) 

def generate_data(numpoints, numcurves, noise=None): 
    np.random.seed(3) 
    x = np.linspace(0, 500, numpoints) 

    height = 7000 * np.random.random(numcurves) 
    mu = 1100 * np.random.random(numcurves) 
    sigma = 100 * np.random.random(numcurves) + 0.1 
    data = gaussian(x, sigma, mu, height) 

    if noise is None: 
     noise = 0.1 * height.max() 
    noise = noise * (np.random.random(data.shape) - 0.5) 
    return x, data + noise 

def gaussian(x, sigma, mu, height): 
    data = -np.subtract.outer(x, mu)**2/(2 * sigma**2) 
    return height * np.exp(data) 

def plot(x, ydata, ax=None, **kwargs): 
    if ax is None: 
     ax = plt.gca() 
    colorcycle = itertools.cycle(mpl.rcParams['axes.color_cycle']) 
    for y, color in zip(ydata.T, colorcycle): 
     #kwargs['color'] = kwargs.get('color', color) 
     ax.plot(x, y, color=color, **kwargs) 

main() 

enter image description here enter image description here

如果這仍然給你的麻煩,然後嘗試反覆,重新加權最小二乘問題(最終的「最佳」 reccomended在@tslisten鏈接方法提及)。請記住,這將會相當慢,但是。

def iterative_weighted_invert(x, y, threshold=None, numiter=5): 
    last_y = y 
    for _ in range(numiter): 
     model = weighted_invert(x, y, weights=last_y, threshold=threshold) 
     last_y = gaussian(x, *model) 
    return model 
+2

http://scipy-central.org/item/28/2/fitting-a-gaussian-to-noisy-data-points瞭解更多信息。 – tillsten 2012-01-09 21:41:04

+1

C = mu^2 /(2 * sigma^2)+ ln(高度)?我不認爲這個2號在mu^2期內被取消了。這是如何在0.5因子的代碼中完成的。 – Michael 2012-01-09 21:51:45

+1

@tillsten - 這是一個非常好的解釋!我以前沒有見過。 – 2012-01-09 22:12:56