熊貓列列表之間的相關性X整個數據框

我正在尋找Pandas .corr（）方法的幫助。熊貓列列表之間的相關性X整個數據框

由於是，我可以使用.corr（）方法來計算列的每一個可能的組合的熱圖：

corr = data.corr() 
sns.heatmap(corr)

其中，在我的23000列的數據幀，可熱死近終止宇宙。

我還可以做值的子集之間的比較合理的相關性

data2 = data[list_of_column_names] 
corr = data2.corr(method="pearson") 
sns.heatmap(corr)

這給了我的東西，我可以使用 - 這裏的是什麼樣子的例子：

什麼我想要做的就是比較20列的列表與整個數據集。正常的.corr（）函數可以給我一個20x20或23,000x23,000熱圖，但本質上我想要一個20x23,000熱圖。

如何爲我的相關性添加更多特異性？

感謝您的幫助！

來源

2017-08-03 CalendarJ

通過這一昨晚工作後，我來到了以下的答案：

#datatable imported earlier as 'data' 
#Create a new dictionary 
plotDict = {} 
# Loop across each of the two lists that contain the items you want to compare 
for gene1 in list_1: 
    for gene2 in list_2: 
     # Do a pearsonR comparison between the two items you want to compare 
     tempDict = {(gene1, gene2): scipy.stats.pearsonr(data[gene1],data[gene2])} 
     # Update the dictionary each time you do a comparison 
     plotDict.update(tempDict) 
# Unstack the dictionary into a DataFrame 
dfOutput = pd.Series(plotDict).unstack() 
# Optional: Take just the pearsonR value out of the output tuple 
dfOutputPearson = dfOutput.apply(lambda x: x.apply(lambda x:x[0])) 
# Optional: generate a heatmap 
sns.heatmap(dfOutputPearson)

就像其他的答案，這會產生一個熱圖（見下文），但它可以擴展到允許一個20,000x30矩陣，而不計算整個20,000x20,000組合之間的相關性（因此終止速度更快）。

來源

2017-08-04 14:37:45 CalendarJ

列出您想要的子集（在本例中爲A，B和C），創建一個空數據框，然後使用嵌套循環填充所需值。

df = pd.DataFrame(np.random.randn(50, 7), columns=list('ABCDEFG')) 

# initiate empty dataframe 
corr = pd.DataFrame() 
for a in list('ABC'): 
    for b in list(df.columns.values): 
     corr.loc[a, b] = df.corr().loc[a, b] 

corr 
Out[137]: 
      A   B   C   D   E   F   G 
A 1.000000 0.183584 -0.175979 -0.087252 -0.060680 -0.209692 -0.294573 
B 0.183584 1.000000 0.119418 0.254775 -0.131564 -0.226491 -0.202978 
C -0.175979 0.119418 1.000000 0.146807 -0.045952 -0.037082 -0.204993 

sns.heatmap(corr)

來源

2017-08-03 15:34:26 Andrew

謝謝你的有用評論！這看起來在理論上效果很好。實際上，它看起來像'corr = data.corr（）。iloc [3：5,1：2]'，它應該是一個相對簡單的相關性，需要相當長的一段時間才能終止（它沒有大約5到目前爲止分鐘）。我猜這是因爲.corr（）首先計算了我所有23,000行之間的相關性，然後再進行分片。 – CalendarJ

好的。我將編輯以展示如何做到這一點。 – Andrew

如果新更改解決了您的問題，請接受此答案。 – Andrew

通常相關的計算係數成對所有變量做最有意義。 pd.corr（）是便利函數，用於計算兩兩相關係數（以及所有對）。你也可以用scipy來做，也僅限於循環中的指定對。

實施例：

在大熊貓

d=pd.DataFrame([[1,5,8],[2,5,4],[7,3,1]], columns=['A','B','C'])

一對可以是：

d.corr().loc['A','B']

-0.98782916114726194

等效在SciPy的：

import scipy.stats 
scipy.stats.pearsonr(d['A'].values,d['B'].values)[0]

-0.98782916114726194

來源

2017-08-03 15:42:41

熊貓列列表之間的相關性X整個數據框

回答

相關問題