如何檢查兩個數據集的匹配列之間的相關性？

如果我們有數據集：如何檢查兩個數據集的匹配列之間的相關性？

import pandas as pd 
a = pd.DataFrame({"A":[34,12,78,84,26], "B":[54,87,35,25,82], "C":[56,78,0,14,13], "D":[0,23,72,56,14], "E":[78,12,31,0,34]}) 
b = pd.DataFrame({"A":[45,24,65,65,65], "B":[45,87,65,52,12], "C":[98,52,32,32,12], "D":[0,23,1,365,53], "E":[24,12,65,3,65]})

一個人如何創建的相關矩陣，其中，y軸表示「a」和所述x軸表示「b」的？

目的是看兩個數據集的匹配列之間的相關性是這樣的：

來源

2016-12-06 ishido

你目標是獲得一個係數或5個不同的係數？ – ayhan

我現在意識到我畫的圖片是誤導性的。我正在尋找不同數據集的每個匹配列之間的單一系數 – ishido

這實現你想要什麼：

from scipy.stats import pearsonr 

# create a new DataFrame where the values for the indices and columns 
# align on the diagonals 
c = pd.DataFrame(columns = a.columns, index = a.columns) 

# since we know set(a.columns) == set(b.columns), we can just iterate 
# through the columns in a (although a more robust way would be to iterate 
# through the intersection of the two sets of columns, in the case your actual dataframes' columns don't match up 
for col in a.columns: 
    correl_signif = pearsonr(a[col], b[col]) # correlation of those two Series 
    correl = correl_signif[0] # grab the actual Pearson R value from the tuple from above 
    c.loc[col, col] = correl # locate the diagonal for that column and assign the correlation coefficient

編輯：嗯，它實現正是你想要的，直到問題被修改。雖然這可以很容易地改變：

c = pd.DataFrame(columns = a.columns, index = a.columns) 

for col in c.columns: 
    for idx in c.index: 
     correl_signif = pearsonr(a[col], b[idx]) 
     correl = correl_signif[0] 
     c.loc[idx, col] = correl

c現在是這樣的：

Out[16]: 
      A   B   C   D   E 
A 0.713185 -0.592371 -0.970444 0.487752 -0.0740101 
B 0.0306753 -0.0705457 0.488012 0.34686 -0.339427 
C -0.266264 -0.0198347 0.661107 -0.50872 0.683504 
D 0.580956 -0.552312 -0.320539 0.384165 -0.624039 
E 0.0165272 0.140005 -0.582389 0.12936 0.286023

來源

2016-12-06 21:19:25 blacksite

是的！對不起，我編輯了我發佈的圖片。是否可以做到這一點，包括所有的相關係數？你得到的矩陣正是我要找的東西。 – ishido

看我上面的編輯 – blacksite

你必須使用熊貓嗎？這似乎可以通過numpy完成，而且很容易。我理解錯誤的任務嗎？

import numpy 
    X = {"A":[34,12,78,84,26], "B":[54,87,35,25,82], "C":[56,78,0,14,13], "D":[0,23,72,56,14], "E":[78,12,31,0,34]} 
    Y = {"A":[45,24,65,65,65], "B":[45,87,65,52,12], "C":[98,52,32,32,12], "D":[0,23,1,365,53], "E":[24,12,65,3,65]} 
    for key,value in X.items(): 
     print "correlation stats for %s is %s" % (key, numpy.corrcoef(value,Y[key]))

來源

2016-12-06 21:21:32 SeedofWInd

如果你不介意一個基於NumPy的矢量化解決方案，基於this solution post到Computing the correlation coefficient between two multi-dimensional arrays -

corr2_coeff(a.values.T,b.values.T).T # func from linked solution post.

採樣運行 -

In [621]: a 
Out[621]: 
    A B C D E 
0 34 54 56 0 78 
1 12 87 78 23 12 
2 78 35 0 72 31 
3 84 25 14 56 0 
4 26 82 13 14 34 

In [622]: b 
Out[622]: 
    A B C D E 
0 45 45 98 0 24 
1 24 87 52 23 12 
2 65 65 32 1 65 
3 65 52 32 365 3 
4 65 12 12 53 65 

In [623]: corr2_coeff(a.values.T,b.values.T).T 
Out[623]: 
array([[ 0.71318502, -0.5923714 , -0.9704441 , 0.48775228, -0.07401011], 
     [ 0.0306753 , -0.0705457 , 0.48801177, 0.34685977, -0.33942737], 
     [-0.26626431, -0.01983468, 0.66110713, -0.50872017, 0.68350413], 
     [ 0.58095645, -0.55231196, -0.32053858, 0.38416478, -0.62403866], 
     [ 0.01652716, 0.14000468, -0.58238879, 0.12936016, 0.28602349]])

來源

2016-12-06 21:47:04 Divakar

我實際上正在考慮將它全部改爲numpy。接下來我想實際上做3個數據集之間的相關性，其中每個列名在每個軸上都有三個值。我認爲numpy會讓這更容易。像這樣：http://seaborn.pydata.org/examples/network_correlations.html – ishido

@ishido當然，它的性能會很好！ :) – Divakar

嗨，我一直在使用這個解決方案很多，謝謝。你用斯皮爾曼的等級相關而不是皮爾遜的方法做過這樣的事嗎？ – ishido

我使用這個功能，打破了下去與numpy

def corr_ab(a, b): 

    a_ = a.values 
    b_ = b.values 
    ab = a_.T.dot(b_) 
    n = len(a) 

    sums_squared = np.outer(a_.sum(0), b_.sum(0)) 
    stds_squared = np.outer(a_.std(0), b_.std(0)) 

    return pd.DataFrame((ab - sums_squared/n)/stds_squared/n, 
         a.columns, b.columns)

演示

corr_ab(a, b)

來源

2016-12-07 00:23:42 piRSquared

如何檢查兩個數據集的匹配列之間的相關性？

回答

相關問題