Python：在字符串列表中查找未知的重複單詞

我有一個字符串列表，它們是來自不同電子郵件對話的主題。我想看看是否有經常使用的單詞或單詞組合。Python：在字符串列表中查找未知的重複單詞

一個例子清單將是：

subjects = [ 
       'Proposal to cooperate - Company Name', 
       'Company Name Introduction', 
       'Into Other Firm/Company Name', 
       'Request for Proposal' 
      ]

功能必須檢測「公司名稱」的組合被使用超過一次，而「建議」被多次使用。這些單詞雖然不會事先知道，但我想它必須開始嘗試所有可能的組合。

實際列表當然比這個例子長很多，所以手動嘗試所有組合似乎並不是最好的方法。什麼是最好的方式去做這件事？

UPDATE

我用添Pietzcker的回答開始開發這個功能，但我得到停留在正確運用計數器。它不斷返回列表的長度作爲所有短語的計數。

短語功能，包括標點符號過濾器，如果這句話已經查了檢查，並每短語最大長度的3個字：通過科目列表

def phrases(string, phrase_list): 
    words = string.split() 
    result = [] 
    punctuation = '\'\"-_,.:;!? ' 
    for number in range(len(words)): 
     for start in range(len(words)-number): 
     if number+1 <= 3: 
      phrase = " ".join(words[start:start+number+1]) 
      if phrase in phrase_list: 
      pass 
      else: 
      phrase_list.append(phrase) 
      phrase = phrase.strip(punctuation).lower() 
      if phrase: 
       result.append(phrase) 
    return result, phrase_list

然後循環：

phrase_list = [] 
ranking = {} 
for s in subjects: 
    result, phrase_list = phrases(s, phrase_list) 
    all_phrases = collections.Counter(phrase.lower() for s in subjects for phrase in result)

「all_phrases」返回一個元組列表，其中每個計數值爲167，這是我使用的主題列表的長度。不知道我在這裏失去了什麼......

來源

2016-03-03 Vincent

這不是重複的。至少不是那個特定的問題。這不是關於列表中的項目，而是關於字符串列表中的常見短語。請在結束前閱讀標題。 –

建議的重複問題絕不會回答我的問題... – Vincent

剛剛重新打開它。 –

你也想找到那些由比單詞短語。沒問題。這應該甚至可以很好地擴展。

import collections 

subjects = [ 
       'Proposal to cooperate - Company Name', 
       'Company Name Introduction', 
       'Into Other Firm/Company Name', 
       'Request for Proposal', 
       'Some more Firm/Company Names' 
      ] 

def phrases(string): 
    words = string.split() 
    result = [] 
    for number in range(len(words)): 
     for start in range(len(words)-number): 
      result.append(" ".join(words[start:start+number+1])) 
    return result

phrases()按空白進行分割輸入字符串，並返回任意長度的所有可能的子功能：

In [2]: phrases("A Day in the Life") 
Out[2]: 
['A', 
'Day', 
'in', 
'the', 
'Life', 
'A Day', 
'Day in', 
'in the', 
'the Life', 
'A Day in', 
'Day in the', 
'in the Life', 
'A Day in the', 
'Day in the Life', 
'A Day in the Life']

現在你可以指望有多少次，每次這些短語的所有主題中找到：

all_phrases = collections.Counter(phrase for subject in subjects for phrase in phrases(subject))

結果：

In [3]: print([(phrase, count) for phrase, count in all_phrases.items() if count > 1]) 
Out [3]: 
[('Company', 4), ('Proposal', 2), ('Firm', 2), ('Name', 3), ('Company Name', 3), 
('Firm /', 2), ('/', 2), ('/ Company', 2), ('Firm/Company', 2)]

請注意，您可能希望使用其他標準，而不是簡單地將空格分開，可能忽略標點符號和大小寫等。

來源

2016-03-04 07:20:15

謝謝，這是一個很好的開始。我已經在循環中實現了這一點，但在櫃檯上遇到了一些麻煩。我已經更新了最新狀態的問題。 – Vincent

我建議你使用空格作爲分隔符，否則如果你沒有指定允許的「短語」應該是什麼樣子，那麼存在太多的可能性。

要指望出現的詞語，您可以使用Counter從collections模塊：

import operator 
from collections import Counter 

d = Counter(' '.join(subjects).split()) 

# create a list of tuples, ordered by occurrence frequency 
sorted_d = sorted(d.items(), key=operator.itemgetter(1), reverse=True) 

# print all entries that occur more than once 
for x in sorted_d: 
    if x[1] > 1: 
     print(x[1], x[0])

輸出：

3 Name 
3 Company 
2 Proposal

來源

2016-03-03 15:20:40

謝謝，這很有幫助。可能通過首先獲得重複的單詞，然後我可以開始尋找單詞組合，使用這個函數找到的單詞。我會稍微玩一下，然後在這裏發表我的結果。 – Vincent

使用'split（）'標記句子的可能替代方法，您也可以使用'nltk'中的'work_tokenize（）'函數。 http://www.nltk.org/book/ch03.html –

到PP_的回答相似。使用分割。

import operator 

subjects = [ 
      'Proposal to cooperate - Company Name', 
      'Company Name Introduction', 
      'Into Other Firm/Company Name', 
      'Request for Proposal' 
     ] 
flat_list = [item for i in subjects for item in i.split() ] 
count_dict = {i:flat_list.count(i) for i in flat_list} 
sorted_dict = sorted(count_dict.items(), reverse=True, key=operator.itemgetter(1))

輸出：

[('Name', 3), 
('Company', 3), 
('Proposal', 2), 
('Other', 1), 
('/', 1), 
('for', 1), 
('cooperate', 1), 
('Request', 1), 
('Introduction', 1), 
('Into', 1), 
('-', 1), 
('to', 1), 
('Firm', 1)]

來源

2016-03-03 15:42:18 Faller

Python：在字符串列表中查找未知的重複單詞

回答

相關問題