2017-05-29 53 views
0

我有一個數據框,其中包含一羣人的文本描述。除此之外,我還有4個描述a,b,c,d。對於每個人的文字描述,我希望通過使用餘弦相似性將它們與4個描述中的每一個進行比較,並將這些得分存儲在4個新列中的相同數據框中:a,b,c,d。使用來自另一列的信息在pandas列上應用函數

我該如何以熊貓的方式做到這一點,而不使用for循環?我正在考慮使用apply函數,但我不知道如何引用'text'列以及apply函數中的4個描述a,b,c,d。

非常感謝您的幫助!

我曾嘗試:

import pandas as pd 
from sklearn.feature_extraction.text import CountVectorizer 
from sklearn.metrics.pairwise import cosine_similarity 

person_one = [' '.join(['table','car','mouse'])] 
person_two = [' '.join(['computer','card','can','mouse'])] 
person_three = [' '.join(['chair','table','whiteboard','window','button'])] 
person_four = [' '.join(['queen','king','joker','phone'])] 

description_a = [' '.join(['table','yellow','car','king'])] 
description_b = [' '.join(['bottle','whiteboard','queen'])] 
description_c = [' '.join(['chair','car','car','phone'])] 
description_d = [' '.join(['joker','blue','earphone','king'])] 

mystuff = [('person 1',person_one), 
      ('person 2',person_two), 
      ('person 3',person_three), 
      ('person 4',person_four) 
      ] 

labels = ['person','text'] 

df = pd.DataFrame.from_records(mystuff,columns = labels) 
df = df.reindex(columns = ['person','text','a','b','c','d']) 

def trying(cell,jd): 
    vectorizer = CountVectorizer(analyzer='word', max_features=5000).fit(jd) 
    jd_vector = vectorizer.transform(jd) 
    person_vector = vectorizer.transform(cell['text']) 
    score = cosine_similarity(jd_vector,person_vector) 

    return score 


df['a'] = df['a'].apply(trying(description_a)) 
df['b'] = df['b'].apply(trying(description_b)) 
df['c'] = df['c'].apply(trying(description_c)) 
df['d'] = df['d'].apply(trying(description_d)) 

這給了我一個錯誤:

df['a'] = df['a'].apply(trying(description_a)) 
TypeError: trying() missing 1 required positional argument: 'jd' 

輸出應該是這個樣子:

 person          text a b c d 
0 person 1       [table, car, mouse] 0.3 0.2 0.5 0.7 
1 person 2    [computer, card, can, mouse] 0.2 0.1 0.9 0.7 
2 person 3 [chair, table, whiteboard, window, button] 0.3 0.5 0.1 0.4 
3 person 4     [queen, king, joker, phone] 0.2 0.4 0.3 0.5 

回答

0

如何:

import pandas as pd 
from sklearn.feature_extraction.text import CountVectorizer 
from sklearn.metrics.pairwise import cosine_similarity 


person_one = ['table','car','mouse'] 
person_two = ['computer','card','can','mouse'] 
person_three = ['chair','table','whiteboard','window','button'] 
person_four = ['queen','king','joker','phone'] 

description_a = ['table','yellow','car','king'] 
description_b = ['bottle','whiteboard','queen'] 
description_c = ['chair','car','car','phone'] 
description_d = ['joker','blue','earphone','king'] 

descriptors = { 
    'a' : description_a, 
    'b' : description_d, 
    'c' : description_c, 
    'd' : description_d 
} 

mystuff = [('person 1',person_one), 
      ('person 2',person_two), 
      ('person 3',person_three), 
      ('person 4',person_four) 
      ] 

labels = ['person','text'] 
df = pd.DataFrame.from_records(mystuff,columns = labels) 

vocabulary_data =[ 
    person_one, 
    person_two, 
    person_three, 
    person_four, 
    description_a, 
    description_b, 
    description_c, 
    description_d, 
] 

data = [set(sentence) for sentence in vocabulary_data] 
vocabulary = set.union(*data) 
cv = CountVectorizer(vocabulary=vocabulary) 


def similarity(row, desc): 
    a = cosine_similarity(cv.fit_transform(row['text']).sum(axis=0), cv.fit_transform(desc).sum(axis=0)) 
    return a.item() 

for key, description in descriptors.items(): 
    df[key] = df.apply(lambda x: similarity(x, description), axis=1) 

我用一個for循環,但只用於填充不同的描述。主要的「計算」是通過apply來完成的。

+0

非常感謝您的幫助!請問.sum(axis = 0)是做什麼的? @ user1870376 – Amoroso

+0

'fit_transform'函數獲取單詞列表,返回一個矩陣,其中每個單詞由一行表示。 'sum(axis = 0)'總和矩陣的行,給我們一個句子的向量表示。 – mensik

3

我不能發表評論還沒有,但要解決的錯誤:

df['a'] = df['a'].apply(trying(description_a)) 
TypeError: trying() missing 1 required positional argument: 'jd' 

你需要傳遞的參數是這樣的:

df['a'] = df['a'].apply(trying, args=(description_a)) 

第一個參數將在列向量你的情況,以及其他參數將按順序從其他參數列表中獲取。

希望得到這個幫助。

相關問題