熊貓羣大熊貓字典

熊貓新手，抱歉，如果解決方案很明顯。熊貓羣大熊貓字典

我有一個數據幀（見下文）與不同的電影場景，對於電影中的場景

import pandas as pd 
data = [{'movie' : 'movie_X', 'scene' : '1', 'environment' : 'home'}, 
     {'movie' : 'movie_X', 'scene' : '2', 'environment' : 'car'}, 
     {'movie' : 'movie_X', 'scene' : '3', 'environment' : 'home'}, 
     {'movie' : 'movie_Y', 'scene' : '1', 'environment' : 'home'}, 
     {'movie' : 'movie_Y', 'scene' : '2', 'environment' : 'office'}, 
     {'movie' : 'movie_Z', 'scene' : '1', 'environment' : 'boat'}, 
     {'movie' : 'movie_Z', 'scene' : '2', 'environment' : 'beach'}, 
     {'movie' : 'movie_Z', 'scene' : '3', 'environment' : 'home' }] 
myDF = pd.DataFrame(data)

環境。在這種情況下，電影有多個流派，他們屬於哪個。我有一本字典（下），說明該類型屬於

genreDict = {'movie_X' : ['romance', 'action'], 
      'movie_Y' : ['comedy', 'romance', 'action'], 
      'movie_Z' : ['horror', 'thriller', 'romance']}

我想是myDF組通過這本字典每部電影，特別是能夠告訴的次數特定的環境特定類型止跌回升（例如，在類型恐怖中，'船'被計數一次，'海灘'被計數一次，'家'被計數一次）。什麼是最好的和最有效的方式去做這件事？我試圖映射字典數據框，然後由列表分組：

myDF['genres'] = myDF['movie'].map(genreDict)

將返回：

movie scene environment    genres 
0 movie_X  1  home   [romance, action] 
1 movie_X  2   car   [romance, action] 
2 movie_X  3  home   [romance, action] 
3 movie_Y  1  home [comedy, romance, action] 
4 movie_Y  2  office [comedy, romance, action] 
5 movie_Z  1  boat [horror, thriller, romance] 
6 movie_Z  2  beach [horror, thriller, romance] 
7 movie_Z  3  home [horror, thriller, romance]

但是，我得到了一個錯誤說列表是unhashable。希望你們都可以幫忙:)

來源

2017-07-17 XyledMonkey

你可以發表你想要的數據集？ – MaxU

如果更大的數據幀速度是由lists與numpy.repeat，numpy.concatenate和Index.values使用numpy的重複行：

#get length of lists in column genres 
l = myDF['genres'].str.len() 
#convert column to numpy array 
vals = myDF['genres'].values 
#repeat index by lenghts 
idx = np.repeat(myDF.index, l) 
#expand rows by duplicated index values 
myDF = myDF.loc[idx] 
#flattening lists column 
myDF['genres'] = np.concatenate(vals) 
#default monotonic index (0,1,2...) 
myDF = myDF.reset_index(drop=True) 
print (myDF) 
    environment movie scene genres 
0   home movie_X  1 romance 
1   home movie_X  1 action 
2   car movie_X  2 romance 
3   car movie_X  2 action 
4   home movie_X  3 romance 
5   home movie_X  3 action 
6   home movie_Y  1 comedy 
7   home movie_Y  1 romance 
8   home movie_Y  1 action 
9  office movie_Y  2 comedy 
10  office movie_Y  2 romance 
11  office movie_Y  2 action 
12  boat movie_Z  1 horror 
13  boat movie_Z  1 thriller 
14  boat movie_Z  1 romance 
15  beach movie_Z  2 horror 
16  beach movie_Z  2 thriller 
17  beach movie_Z  2 romance 
18  home movie_Z  3 horror 
19  home movie_Z  3 thriller 
20  home movie_Z  3 romance

然後用groupby和聚集size：

df1 = df.groupby(['genres','environment']).size().reset_index(name='count') 
print (df1) 
     genres environment count 
0  action   car  1 
1  action  home  3 
2  action  office  1 
3  comedy  home  1 
4  comedy  office  1 
5  horror  beach  1 
6  horror  boat  1 
7  horror  home  1 
8 romance  beach  1 
9 romance  boat  1 
10 romance   car  1 
11 romance  home  4 
12 romance  office  1 
13 thriller  beach  1 
14 thriller  boat  1 
15 thriller  home  1

來源

2017-07-17 18:35:11 jezrael

非標量物體一般會造成熊貓問題。除此之外，您需要整理數據，以便您的後續步驟更輕鬆（表格結構上的主要操作通常定義在整潔的數據集上）。你需要一個數據集，你不需要在一行中列出所有流派，而是每個流派都有自己的行。

下面是可能的方式來實現這一目標之一：

genre_df = pd.DataFrame(myDF['movie'].map(genreDict).tolist()) 

df = myDF.join(genre_df.stack().rename('genre').reset_index(level=1, drop=True)) 
df 
Out: 
    environment movie scene  genre 
0  home movie_X  1 romance 
0  home movie_X  1 action 
1   car movie_X  2 romance 
1   car movie_X  2 action 
2  home movie_X  3 romance 
2  home movie_X  3 action 
3  home movie_Y  1 comedy 
3  home movie_Y  1 romance 
3  home movie_Y  1 action 
4  office movie_Y  2 comedy 
4  office movie_Y  2 romance 
4  office movie_Y  2 action 
5  boat movie_Z  1 horror 
5  boat movie_Z  1 thriller 
5  boat movie_Z  1 romance 
6  beach movie_Z  2 horror 
6  beach movie_Z  2 thriller 
6  beach movie_Z  2 romance 
7  home movie_Z  3 horror 
7  home movie_Z  3 thriller 
7  home movie_Z  3 romance

一旦你有這樣的結構，它是組或跨容易得多製表你的數據：

df.groupby('genre').size() 
Out: 
genre 
action  5 
comedy  2 
horror  3 
romance  8 
thriller 3 
dtype: int64 

pd.crosstab(df['genre'], df['environment']) 
Out: 
environment beach boat car home office 
genre          
action   0  0 1  3  1 
comedy   0  0 0  1  1 
horror   1  1 0  1  0 
romance   1  1 1  4  1 
thriller   1  1 0  1  0

這裏有一個Hadley Wickham的精彩閱讀：Tidy Data。

來源

2017-07-17 18:27:03 ayhan

熊貓羣大熊貓字典

回答

相關問題