使用Python打印屬於文檔中最常用單詞的句子

我有一個文本文檔，我使用regex和nltk來查找本文檔中最常見的單詞5。我必須打印這些單詞所屬的句子，我該怎麼做？此外，我想擴展到在多個文檔中查找常用單詞並返回它們各自的句子。使用Python打印屬於文檔中最常用單詞的句子

import nltk 
import collections 
from collections import Counter 

import re 
import string 

frequency = {} 
document_text = open('test.txt', 'r') 
text_string = document_text.read().lower() 
match_pattern = re.findall(r'\b[a-z]{3,15}\b', text_string) #return all the words with the number of characters in the range [3-15] 

fdist = nltk.FreqDist(match_pattern) # creates a frequency distribution from a list 
most_common = fdist.max() # returns a single element 
top_five = fdist.most_common(5)# returns a list 

list_5=[word for (word, freq) in fdist.most_common(5)] 


print(top_five) 
print(list_5)

輸出：

[('you', 8), ('tuples', 8), ('the', 5), ('are', 5), ('pard', 5)] 
['you', 'tuples', 'the', 'are', 'pard']

輸出最常出現的話，我必須打印在那裏這些話屬於，我怎麼做到這一點的句子？

來源

2017-08-20 Ajinkya Bobade

雖然不佔在喜歡你的碼字邊界特殊字符呢，以下將是一個起點：

for sentence in text_string.split('.'): 
    if list(set(list_5) & set(sentence.split(' '))): 
     print sentence

我們首先遍歷句子，假設每個句子以.結束並且.字符在文本中沒有其他地方。之後，如果您的list_5中的詞組集合中的單詞集合中的intersection不爲空，我們將打印該句子。

來源

2017-08-20 17:07:44

如何刪除其他部分，你的代碼的輸出爲：;} {\ levelnumbers \」 01;} \ FI-360 \ li720 \ lin720} {\ LISTNAME;} \ listid1}} {\ * \ listoverridetable {\ listoverride \ listid1 \ listoverridecount0 \ LS1}} \ margl1440 \ margr1440 \ vieww14360 \ viewh11020 \ viewkind0 \ deftab720 \ f0 \ fs32 \ cf2 \ cb3 \ expnd0 \ expndtw0 \ kerning0 \ outl0 \ strokewidth0 \ strokec2在我以前的複習中，您可以從本文頂部的系列導航鏈接進入，我介紹了兩個重要的你需要掌握的python概念爲了在您的Python學習之旅中前進 \'a0 \ –

快速提示：我的文本文件開始如下：「在我以前的複習中，您可以從本文頂部的系列導航鏈接訪問，我談到了關於您需要掌握的兩個重要Python概念，以便在您的Python學習之旅中前進。「 –

如果您還沒有安裝NLTK Data，您將需要安裝NLTK Data。

從http://www.nltk.org/data.html：

運行Python解釋並鍵入命令：

> >>> import nltk 
> >>> nltk.download()

一個新的窗口應打開，示出了NLTK下載程序。點擊文件菜單，然後選擇更改下載目錄。

然後從模型選項卡安裝punkt模型。一旦你有，你可以令牌化所有的句子，因此提取與您的前5話的人在其中：

sent_tokenize_list = nltk.sent_tokenize(text_string)  
for sentence in sent_tokenize_list: 
    for word in list_5: 
     if word in sentence: 
      print(sentence)

來源

2017-08-20 17:11:05

我試過了，如何從輸出中刪除這個附加的不必要的部分：輸出是：;} {\ levelnumbers \ '01;} \ fi-360 \ li720 \ lin720} {\ listname;} \ listid1}} { \ * \ listoverridetable {\ listoverride \ listid1 \ listoverridecount0 \ LS1}} \ margl1440 \ margr1440 \ vieww14360 \ viewh11020 \ viewkind0 \ deftab720 \ PARD \ pardeftab720 \ sl512 \ sa520 \ partightenfactor0 \ F0 \ FS32 \ CF2 \ CB3 \ expnd0 \ expndtw0 \ kerning0 \ outl0 \ strokewidth0 \ strokec2在我以前的刷新器中，您可以從本文頂部的系列導航鏈接中訪問，我說過 –

是輸出文本文件的一部分嗎？ –

不，我的文本文件開始如下：「在我以前的刷新器中，可以從系列導航欄訪問在本文頂部的離子鏈接中，我談到了您需要掌握的兩個重要的Python概念，以便在您的Python學習之旅中前進。「 –

使用Python打印屬於文檔中最常用單詞的句子

回答

相關問題