2017-10-04 87 views
1

的陣列我對文本挖掘以下數據框:令牌找到最適合的句子令牌

df = pd.DataFrame({'text':["Anyone who reads Old and Middle English literary texts will be familiar with the mid-brown volumes of the EETS, with the symbol of Alfreds jewel embossed on the front cover", 
        "Most of the works attributed to King Alfred or to Aelfric, along with some of those by bishop Wulfstan and much anonymous prose and verse from the pre-Conquest period, are to be found within the Society's three series", 
        "all of the surviving medieval drama, most of the Middle English romances, much religious and secular prose and verse including the English works of John Gower, Thomas Hoccleve and most of Caxton's prints all find their place in the publications", 
        "Without EETS editions, study of medieval English texts would hardly be possible."]}) 



text 
0 Anyone who reads Old and Middle English litera... 
1 Most of the works attributed to King Alfred or... 
2 all of the surviving medieval drama, most of t... 
3 Without EETS editions, study of medieval Engli... 

而且我有名單:

tokens = [['middl engl', 'mid-brown', 'symbol'], ["king", 'anonym', 'series'], ['mediev', 'romance', 'relig'], ['hocclev', 'edit', 'publ']] 

我試圖找到最上面的列表令牌的每個令牌陣列適當的句子。

更新:我被要求更詳細地解釋我的問題。

問題是,我在非英語文本上做這件事,所以說明我的問題更多一些是很有問題的。

我在尋找一些函數x其作爲輸入的每個元素我令牌名單併爲令牌列表中的每個元素,它會搜索最合適的(也許在某些指標意義上的)句子df.text。這是輸出無關緊要的主要思想。我只是想要它的工作:)

+1

另外,你能解釋一些關於你的問題,並添加預期的輸出? –

+0

計算句子和標記列表之間的相似度,選擇一個標記列表中最相似的句子作爲其輸出句子。或者,更簡單的方法是,計算句子中每個記號列表的標記的出現次數,選擇出現記號最多的句子作爲記號列表的輸出。 – mutux

回答

0

正如我前面所說,這篇文章只是我的問題的例證。我正在解決聚類問題。我使用LDA和K-means算法來做到這一點。爲了找到最適合我的代幣清單的合適的句子,我使用了K均值距離參數

import pandas as pd 
from sklearn.feature_extraction.text import TfidfVectorizer 
import lda 
from sklearn.feature_extraction.text import CountVectorizer 
import logging 
from sklearn.cluster import MiniBatchKMeans 
from sklearn import preprocessing 

df = pd.DataFrame({'text':["Anyone who reads Old and Middle English literary texts will be familiar with the mid-brown volumes of the EETS, with the symbol of Alfreds jewel embossed on the front cover", 
         "Most of the works attributed to King Alfred or to Aelfric, along with some of those by bishop Wulfstan and much anonymous prose and verse from the pre-Conquest period, are to be found within the Society's three series", 
         "all of the surviving medieval drama, most of the Middle English romances, much religious and secular prose and verse including the English works of John Gower, Thomas Hoccleve and most of Caxton's prints all find their place in the publications", 
         "Without EETS editions, study of medieval English texts would hardly be possible."], 
        'tokens':[['middl engl', 'mid-brown', 'symbol'], ["king", 'anonym', 'series'], ['mediev', 'romance', 'relig'], ['hocclev', 'edit', 'publ']]}) 
df['tokens'] = df.tokens.str.join(',') 


vectorizer = TfidfVectorizer(min_df=1, max_features=10000, ngram_range=(1, 2)) 
vz = vectorizer.fit_transform(df['tokens']) 

logging.getLogger("lda").setLevel(logging.WARNING) 
cvectorizer = CountVectorizer(min_df=1, max_features=10000, ngram_range=(1,2)) 
cvz = cvectorizer.fit_transform(df['tokens']) 

n_topics = 4 

n_iter = 2000 
lda_model = lda.LDA(n_topics=n_topics, n_iter=n_iter) 
X_topics = lda_model.fit_transform(cvz) 

num_clusters = 4 
kmeans_model = MiniBatchKMeans(n_clusters=num_clusters, init='k-means++', n_init=1, 
         init_size=1000, batch_size=1000, verbose=False, max_iter=1000) 
kmeans = kmeans_model.fit(vz) 
kmeans_clusters = kmeans.predict(vz) 
kmeans_distances = kmeans.transform(vz) 

X_all = X_topics 
kmeans1 = kmeans_model.fit(X_all) 
kmeans_clusters1 = kmeans1.predict(X_all) 
kmeans_distances1 = kmeans1.transform(X_all) 
d = dict() 
l = 1 


for i, desc in enumerate(df.text): 
    if(i < 3): 
     num = 3 
     if kmeans_clusters1[i] == num: 
      if l > kmeans_distances1[i][kmeans_clusters1[i]]: 
       l = kmeans_distances1[i][kmeans_clusters1[i]] 
      d['Cluster' + str(kmeans_clusters1[i])] = "distance: " + str(l)+ " "+ df.iloc[i]['text'] 
      print("Cluster " + str(kmeans_clusters1[i]) + ": " + desc + 
        "(distance: " + str(kmeans_distances1[i][kmeans_clusters1[i]]) + ")") 
      print('---') 
print("Cluster " + str(num) + " " + str(d.get('Cluster' + str(num)))) 

因此,與特定的集羣內的最低距離令牌,是最適合的。