Python - 摺疊字符串組

-2

我有一個字符串列表，例如['Apple', 'Appl','Elephnt', 'Elephant']。我需要將這個字符串列表合併成不同的組，即['Apple', 'Elephnt']。Python - 摺疊字符串組

我應該在同一組中的字符串標準基於80％以上的百分比匹配。即Apple和Appl分享88％的比賽，而Elephnt和Elephant分享93％的比賽。

def similar(a, b): 
    return SequenceMatcher(None, a, b).ratio()

函數similar用於計算兩個字符串的百分比匹配。如何使用上述函數計算此摺疊分組？

來源

2017-07-07 Bryce Ramgovind

這與'pandas'有什麼關係？ –

你如何選擇小組代表，它是否始終是最初列表中小組的第一個單詞？ – randomir

如果你希望你的字符串（地名）的初步名單分成組，每組相似的字符串列表：

from difflib import SequenceMatcher 
from functools import partial 

def is_similar(a, b): 
    return SequenceMatcher(None, a, b).ratio() > 0.8 

def similar_groups(names): 
    remaining = set(names) 
    groups = [] 
    while remaining: 
     ref = remaining.pop() 
     group = [ref] + filter(partial(is_similar, ref), remaining) 
     groups.append(group) 
     remaining -= set(group) 
    return groups

例如：

>>> similar_groups(['Apple', 'Appl','Elephnt', 'Elephant']) 
[['Elephant', 'Elephnt'], ['Appl', 'Apple']]

來源

2017-07-07 12:12:12 randomir

這似乎是你想要的。它的主要問題是它有大量非相似字符串列表的二次次序。鑑於「類似」接近平等但不完全相同（例如，它不是傳遞性的），我看不出任何方法來減少算法的順序。例如，我看不到如何對項目進行排序，以便可以使用itertool的groupby函數。

主要想法是在結果列表中添加一個字符串，如果它與任何以前的字符串都不相似。

from difflib import SequenceMatcher 

def similar(a, b): 
    return SequenceMatcher(None, a, b).ratio() 

def collapse_similar(strlist): 
    """Eliminate "duplicate" strings in a list, where "duplicate" means 
    similar() is more than 80%. 
    """ 
    result = [] 
    for s in strlist: 
     if all(similar(s, v) <= 0.8 for v in result): 
      result.append(s) 
    return result

的collapse_similar(['Apple', 'Appl','Elephnt', 'Elephant'])結果是['Apple', 'Elephnt']，根據需要。

來源

2017-07-07 11:52:02

-1

那麼有很多的如何做到這一點。這裏是一個例子

from difflib import SequenceMatcher 
from itertools import combinations 
a_list = ['Apple', 'Appl','Elephnt', 'Elephant'] 

def similar(a, b): 
    return SequenceMatcher(None, a, b).ratio() > 0.8 

print [a for (a, b) in combinations(a_list, 2) if similar(a,b)]

來源

2017-07-07 13:00:08 Warmley

我相信你的代碼包含一個字符串，如果它與後面的字符串類似，這不是問什麼。你的代碼遺漏了任何單例（沒有其他類似的字符串），只剩下字符串的最後一個確切副本（ADDED：它似乎比這更復雜）。因此，對於'['蘋果'，'大象'，'大象'，'大象']，你會印出'[Elephant'，'Elephant'，'Elephant']，但是OP要'''Apple'，'大象']'。 –

Python - 摺疊字符串組

回答

相關問題