2017-02-27 88 views
2

我需要乘以兩個具有相同最高級別索引的MultiIndexed幀(比如df1, df2),以便對於每個最高級索引,每行df1乘以每行元素爲df2。我已實現了以下的例子,我想要做什麼,但它看起來很醜陋:兩個熊貓MultiIndex幀每行與每一行相乘

a = ['alpha', 'beta'] 
b = ['A', 'B', 'C'] 
c = ['foo', 'bar'] 
df1 = pd.DataFrame(np.random.randn(6, 4), 
        index=pd.MultiIndex.from_product(
         [a, b], 
         names=['greek', 'latin']), 
        columns=['C1', 'C2', 'C3', 'C4']) 
df2 = pd.DataFrame(
    np.array([[1, 0, 1, 0], [1, 1, 1, 1], [0, 0, 0, 0], [0, 2, 0, 4]]), 
    index=pd.MultiIndex.from_product([a, c], names=['greek', 'foobar']), 
    columns=['C1', 'C2', 'C3', 'C4']) 

df3 = pd.DataFrame(
    columns=['greek', 'latin', 'foobar', 'C1', 'C2', 'C3', 'C4']) 

for i in df1.index.get_level_values('greek').unique(): 
    for j in df1.loc[i].index.get_level_values('latin').unique(): 
     for k in df2.loc[i].index.get_level_values('foobar').unique(): 
      df3 = df3.append(pd.Series([i, j, k], 
             index=['greek', 'latin', 'foobar'] 
             ).append(
       df1.loc[i, j] * df2.loc[i, k]), ignore_index=True) 

df3.set_index(['greek', 'latin', 'foobar'], inplace=True) 

正如你所看到的代碼是非常手動定義手動多次柱等,並設置指標到底。這裏是輸入和選擇。他們是正確的,正是我想要的:

DF1:

    C1  C2  C3  C4 
greek latin           
alpha A  0.208380 0.856373 -1.041598 1.219707 
     B  1.547903 -0.001023 0.918973 1.153554 
     C  0.195868 2.772840 0.060960 0.311247 
beta A  0.690405 -1.258012 0.118000 -0.346677 
     B  0.488327 -1.206428 0.967658 1.198287 
     C  0.420098 -0.165721 0.626893 -0.377909, 

DF2:

   C1 C2 C3 C4 
greek foobar     
alpha foo  1 0 1 0 
     bar  1 1 1 1 
beta foo  0 0 0 0 
     bar  0 2 0 4 

結果:

      C1  C2  C3  C4 
greek latin foobar           
alpha A  foo  0.208380 0.000000 -1.041598 0.000000 
      bar  0.208380 0.856373 -1.041598 1.219707 
     B  foo  1.547903 -0.000000 0.918973 0.000000 
      bar  1.547903 -0.001023 0.918973 1.153554 
     C  foo  0.195868 0.000000 0.060960 0.000000 
      bar  0.195868 2.772840 0.060960 0.311247 
beta A  foo  0.000000 -0.000000 0.000000 -0.000000 
      bar  0.000000 -2.516025 0.000000 -1.386708 
     B  foo  0.000000 -0.000000 0.000000 0.000000 
      bar  0.000000 -2.412855 0.000000 4.793149 
     C  foo  0.000000 -0.000000 0.000000 -0.000000 
      bar  0.000000 -0.331443 0.000000 -1.511638 

提前致謝!

回答

2

我創建了以下的解決方案,似乎工作,並提供一個正確的結果。雖然斯蒂芬的答案仍然是最快的解決方案,但它足夠接近但提供了很大的優勢,它適用於任意多索引幀,而索引是列表的結果。這是我需要解決的情況,儘管我提供的例子並沒有反映出這一點。感謝Stephen爲這種情況提供了出色且快速的解決方案 - 當然從該代碼中學到了一些東西!

代碼:

dft = df2.swaplevel() 
dft.sortlevel(level=0,inplace=True) 
df5=pd.concat([df1*dft.loc[i,:] for i in dft.index.get_level_values('foobar').unique() ], keys=dft.index.get_level_values('foobar').unique().tolist(), names=['foobar']) 
df5=df5.reorder_levels(['greek', 'latin', 'foobar'],axis=0) 
df5.sortlevel(0,inplace=True) 

測試數據:

import pandas as pd 
import numpy as np 

a = ['alpha', 'beta'] 
b = ['A', 'B', 'C'] 
c = ['foo', 'bar'] 
data_columns = ['C1', 'C2', 'C3', 'C4'] 
columns = ['greek', 'latin', 'foobar'] + data_columns 

df1 = pd.DataFrame(np.random.randn(len(a) * len(b), len(data_columns)), 
        index=pd.MultiIndex.from_product(
         [a,b], names=columns[0:2]), 
        columns=data_columns 
        ) 
df2 = pd.DataFrame(np.array([[1, 0, 1, 0], 
          [1, 1, 1, 1], 
          [0, 0, 0, 0], 
          [0, 2, 0, 4], 
          ]), 
        index=pd.MultiIndex.from_product(
         [a, c], 
         names=[columns[0], columns[2]]), 
        columns=data_columns 
        ) 

時刻碼:

def method1(): 
    df3 = pd.DataFrame(columns=columns) 

    for i in df1.index.get_level_values('greek').unique(): 
      for j in df1.loc[i].index.get_level_values('latin').unique(): 
       for k in df2.loc[i].index.get_level_values('foobar').unique(): 
        df3 = df3.append(pd.Series(
         [i, j, k], 
         index=columns[:3]).append(
         df1.loc[i, j] * df2.loc[i, k]), ignore_index=True) 
    df3.set_index(columns[:3], inplace=True) 
    return df3 

def method2(): 
    # build an index from the three index columns 
    idx = [df1.index.get_level_values(col).unique() for col in columns[:2] 
      ] + [df2.index.get_level_values(columns[2]).unique()] 
    size = [len(x) for x in idx] 
    index = pd.MultiIndex.from_product(idx, names=columns[:3]) 

    # get the indices needed for df1 and df2 
    idx_a = np.indices((size[0] * size[1], size[2])).reshape(2, -1) 
    idx_b = np.indices((size[0], size[1] * size[2])).reshape(2, -1) 
    idx_1 = idx_a[0] 
    idx_2 = idx_a[1] + idx_b[0] * size[2] 

    # map the two frames into a multiply-able form 
    y1 = df1.values[idx_1, :] 
    y2 = df2.values[idx_2, :] 

    # multiply the to frames 
    df4 = pd.DataFrame(y1 * y2, index=index, columns=columns[3:]) 
    return df4 


def method3(): 
    dft = df2.swaplevel() 
    dft.sortlevel(level=0,inplace=True) 
    df5=pd.concat([df1*dft.loc[i,:] for i in dft.index.get_level_values('foobar').unique() ], keys=dft.index.get_level_values('foobar').unique().tolist(), names=['foobar']) 
    df5=df5.reorder_levels(['greek', 'latin', 'foobar'],axis=0) 
    df5.sortlevel(0,inplace=True) 
    return df5 


from timeit import timeit 
print(timeit(method1, number=50)) 
print(timeit(method2, number=50)) 
print(timeit(method3, number=50)) 

結果:

4.089807642158121 
0.12291539693251252 
0.33667341712862253 
1

這裏是你的代碼沒有for循環。基本思想是擴展兩個矩陣,使它們具有相同的大小並可以相乘。然後乘...

代碼:

# build an index from the three index columns 
idx = [df1.index.get_level_values(col).unique() for col in columns[:2] 
     ] + [df2.index.get_level_values(columns[2]).unique()] 
size = [len(x) for x in idx] 
index = pd.MultiIndex.from_product(idx, names=columns[:3]) 

# get the indices needed for df1 and df2 
idx_a = np.indices((size[0] * size[1], size[2])).reshape(2, -1) 
idx_b = np.indices((size[0], size[1] * size[2])).reshape(2, -1) 
idx_1 = idx_a[0] 
idx_2 = idx_a[1] + idx_b[0] * size[2] 

# map the two frames into a multiply-able form 
y1 = df1.values[idx_1, :] 
y2 = df2.values[idx_2, :] 

# multiply the two frames 
df = pd.DataFrame(y1 * y2, index=index, columns=columns[3:]) 

測試數據:

import pandas as pd 
import numpy as np 

a = ['alpha', 'beta'] 
b = ['A', 'B', 'C'] 
c = ['foo', 'bar'] 
data_columns = ['C1', 'C2', 'C3', 'C4'] 
columns = ['greek', 'latin', 'foobar'] + data_columns 

df1 = pd.DataFrame(np.random.randn(len(a) * len(b), len(data_columns)), 
        index=pd.MultiIndex.from_product(
         [a,b], names=columns[0:2]), 
        columns=data_columns 
        ) 
df2 = pd.DataFrame(np.array([[1, 0, 1, 0], 
          [1, 1, 1, 1], 
          [0, 0, 0, 0], 
          [0, 2, 0, 4], 
          ]), 
        index=pd.MultiIndex.from_product(
         [a, c], 
         names=[columns[0], columns[2]]), 
        columns=data_columns 
        ) 

時刻碼:

def method1(): 
    df3 = pd.DataFrame(columns=columns) 

    for i in df1.index.get_level_values('greek').unique(): 
      for j in df1.loc[i].index.get_level_values('latin').unique(): 
       for k in df2.loc[i].index.get_level_values('foobar').unique(): 
        df3 = df3.append(pd.Series(
         [i, j, k], 
         index=columns[:3]).append(
         df1.loc[i, j] * df2.loc[i, k]), ignore_index=True) 
    df3.set_index(columns[:3], inplace=True) 
    return df3 

def method2(): 
    # build an index from the three index columns 
    idx = [df1.index.get_level_values(col).unique() for col in columns[:2] 
      ] + [df2.index.get_level_values(columns[2]).unique()] 
    size = [len(x) for x in idx] 
    index = pd.MultiIndex.from_product(idx, names=columns[:3]) 

    # get the indices needed for df1 and df2 
    idx_a = np.indices((size[0] * size[1], size[2])).reshape(2, -1) 
    idx_b = np.indices((size[0], size[1] * size[2])).reshape(2, -1) 
    idx_1 = idx_a[0] 
    idx_2 = idx_a[1] + idx_b[0] * size[2] 

    # map the two frames into a multiply-able form 
    y1 = df1.values[idx_1, :] 
    y2 = df2.values[idx_2, :] 

    # multiply the to frames 
    df4 = pd.DataFrame(y1 * y2, index=index, columns=columns[3:]) 
    return df4 

from timeit import timeit 
print(timeit(method1, number=50)) 
print(timeit(method2, number=50)) 

個結果:

7.96668368373 
0.149504332128