我有一個數據框,其中包含一羣人的文本描述。除此之外,我還有4個描述a,b,c,d。對於每個人的文字描述,我希望通過使用餘弦相似性將它們與4個描述中的每一個進行比較,並將這些得分存儲在4個新列中的相同數據框中:a,b,c,d。使用來自另一列的信息在pandas列上應用函數
我該如何以熊貓的方式做到這一點,而不使用for循環?我正在考慮使用apply函數,但我不知道如何引用'text'列以及apply函數中的4個描述a,b,c,d。
非常感謝您的幫助!
我曾嘗試:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
person_one = [' '.join(['table','car','mouse'])]
person_two = [' '.join(['computer','card','can','mouse'])]
person_three = [' '.join(['chair','table','whiteboard','window','button'])]
person_four = [' '.join(['queen','king','joker','phone'])]
description_a = [' '.join(['table','yellow','car','king'])]
description_b = [' '.join(['bottle','whiteboard','queen'])]
description_c = [' '.join(['chair','car','car','phone'])]
description_d = [' '.join(['joker','blue','earphone','king'])]
mystuff = [('person 1',person_one),
('person 2',person_two),
('person 3',person_three),
('person 4',person_four)
]
labels = ['person','text']
df = pd.DataFrame.from_records(mystuff,columns = labels)
df = df.reindex(columns = ['person','text','a','b','c','d'])
def trying(cell,jd):
vectorizer = CountVectorizer(analyzer='word', max_features=5000).fit(jd)
jd_vector = vectorizer.transform(jd)
person_vector = vectorizer.transform(cell['text'])
score = cosine_similarity(jd_vector,person_vector)
return score
df['a'] = df['a'].apply(trying(description_a))
df['b'] = df['b'].apply(trying(description_b))
df['c'] = df['c'].apply(trying(description_c))
df['d'] = df['d'].apply(trying(description_d))
這給了我一個錯誤:
df['a'] = df['a'].apply(trying(description_a))
TypeError: trying() missing 1 required positional argument: 'jd'
輸出應該是這個樣子:
person text a b c d
0 person 1 [table, car, mouse] 0.3 0.2 0.5 0.7
1 person 2 [computer, card, can, mouse] 0.2 0.1 0.9 0.7
2 person 3 [chair, table, whiteboard, window, button] 0.3 0.5 0.1 0.4
3 person 4 [queen, king, joker, phone] 0.2 0.4 0.3 0.5
非常感謝您的幫助!請問.sum(axis = 0)是做什麼的? @ user1870376 – Amoroso
'fit_transform'函數獲取單詞列表,返回一個矩陣,其中每個單詞由一行表示。 'sum(axis = 0)'總和矩陣的行,給我們一個句子的向量表示。 – mensik