省略一組名稱

好吧，我確定有人在某個地方肯定已經爲此提出了一個算法，所以我想我會在我離開前自己去重新創建它。省略一組名稱

我有一個任意（用戶輸入的）非空文本字符串的列表。每個字符串可以是任意長度（0除外），它們都是唯一的。我想將它們展示給用戶，但我想將它們修剪成我確定的固定長度，並用省略號（...）替換它們的一部分。問題是我希望所有的輸出字符串都是唯一的。

例如，如果我有字符串：

的Microsoft Internet Explorer 6
微軟的Internet Explorer 7
微軟的Internet Explorer 8
Mozilla Firefox瀏覽器3
Mozilla Firefox瀏覽器4
Google Chrome 14

然後我不想修剪字符串的末尾，因爲這是獨特的部分（不想顯示「Microsoft Internet ...」3次），但可以剪掉中間部分：

微軟... RER 6
微軟... RER 7
微軟... RER 8
是Mozilla Firefox 3
的Mozilla Firefox 4
谷歌瀏覽器14

其他時候，中間部分可能是獨一無二的，我要修剪的結尾：

公司會議，2010年5月25日的

分鐘 - 僅供內部使用
公司會議，2010年6月24日的紀要 - 內部使用公司會議，23/7/2010只有
分鐘 - 僅供內部使用

將變成：

公司會議，2010年5月25日的分...
公司會議，2010年6月24日的分...
公司會議，23/7/2010紀要...

我猜應該可能永遠ellipsize的非常開頭的字符串，即使否則將被允許的，因爲這將是怪異。我猜想它可以在字符串中的多個位置上省略橢圓，但在合理範圍內 - 也許2次會好，但3次或更多似乎過多。或者可能的次數並不像剩餘塊的大小那麼重要：橢圓之間少於大約5個字符將是毫無意義的。

輸入（數量和大小）不會很大，所以性能不是一個主要問題（好吧，只要算法不會像列舉所有可能的字符串一樣愚蠢，直到找到一個集合這樣可行！）。

我想這些要求看起來很具體，但我其實很寬鬆 - 我只是想描述一下我的想法。

之前有過類似的事情嗎？是否有一些現有的算法或庫這樣做？我搜索了一些，但到目前爲止還沒有發現任何東西（但也許我只是在使用google搜索而已）。我必須相信有人想要解決這個問題！

來源

2011-02-14 Ken

這聽起來像的longest common substring problem.

申請更換最長共同子串用省略號所有字符串。如果字符串仍然過長，並且允許您使用其他省略號，請重複。

您必須認識到，您可能無法「橢圓化」一組足夠符合長度要求的字符串。

來源

2011-02-14 20:28:17 erickson

嗯，這不是一個不好的起點，但我不認爲這是我想要的。也許我的例子沒有被選擇清楚，但我不要求橢圓只替換相等的子字符串：只是輸出字符串是唯一的。例如，如果給出兩個輸入「Herzkreislaufwiederbelebung」和「Geschwindigkeitsbegrenzung」，並且我想修剪到長度= 12（包括圓點），則可以返回「Herzkreis ...」和「Geschwind ...」。 – Ken 2011-02-14 20:37:23

@Ken聽起來像你可以讓他們呆滯。 – Orbling 2011-02-14 20:56:27

排序字符串。保留每個字符串的前X個字符。如果此前綴對字符串前後不唯一，則前進至唯一字符（與前後字符串相比）。（如果找不到唯一字符，則該字符串沒有唯一部分，請參閱帖子底部）在這些唯一字符前後添加省略號。

注意，這可能仍然看起來很有趣：

Microsoft Office -> Micro...ffice 
Microsoft Outlook -> Micro...utlook

我不知道你要做到這一點用什麼語言，但這裏有一個Python實現。

def unique_index(before, current, after, size): 
    '''Returns the index of the first part of _current_ of length _size_ that is 
     unique to it, _before_, and _after_. If _current_ has no part unique to it, 
     _before_, and _after_, it returns the _size_ letters at the end of _current_''' 
    before_unique = False 
    after_unique = False 
    for i in range(len(current)-size): 
     #this will be incorrect in the case mentioned below 
     if i > len(before)-1 or before[i] != current[i]: 
      before_unique = True 
     if i > len(after)-1 or after[i] != current[i]: 
      after_unique = True 
     if before_unique and after_unique: 
      return i 

    return len(current)-size 

def ellipsize(entries, prefix_size, max_string_length): 
    non_prefix_size = max_string_length - prefix_size #-len("...")? Post isn't clear about this. 

    #If you want to preserve order then make a copy and make a mapping from the copy to the original 
    entries.sort() 

    ellipsized = [] 

    # you could probably remove all this indexing with something out of itertools 
    for i in range(len(entries)): 
     current = entries[i] 

     #entry is already short enough, don't need to truncate 
     if len(current) <= max_string_length: 
      ellipsized.append(current) 
      continue 

     #grab empty strings if there's no string before/after 
     if i == 0: 
      before = '' 
     else: 
      before = entries[i-1] 
     if i == len(entries)-1: 
      after = '' 
     else: 
      after = entries[i+1] 

     #Is the prefix unique? If so, we're done. 
     current_prefix = entries[i][:prefix_size]  
     if not before.startswith(current_prefix) and not after.startswith(current_prefix): 
      ellipsized.append(current[:max_string_length] + '...') #again, possibly -3 

     #Otherwise find the unique part after the prefix if it exists. 
     else: 
      index = prefix_size + unique_index(before[prefix_size:], current[prefix_size:], after[prefix_size:], non_prefix_size) 
      if index == prefix_size: 
       header = '' 
      else: 
       header = '...' 
      if index + non_prefix_size == len(current): 
       trailer = '' 
      else: 
       trailer = '...' 
      ellipsized.append(entries[i][:prefix_size] + header + entries[i][index:index+non_prefix_size] + trailer) 
    return ellipsized

另外，你提到的字符串本身是唯一的，但它們是否都有獨特的部分？例如，「Microsoft」和「Microsoft Internet Explorer 7」是兩個不同的字符串，但第一個沒有與第二個不同的部分。如果是這種情況，那麼您必須在規範中添加一些內容，以便使該案例明確無誤。（如果添加「Xicrosoft」，「MXcrosoft」，「MiXrosoft」等與這兩個字符串混合在一起，則存在否唯一字符串比原始字符串短以代表「Microsoft」）（另一種思考方式它：如果你有所有可能的X字符串，你不能將它們全部壓縮到X-1或更少的字符串。就像沒有壓縮方法可以壓縮所有的輸入一樣，因爲這本質上是一種壓縮方法）。結果來自原帖：

>>> for entry in ellipsize(["Microsoft Internet Explorer 6", "Microsoft Internet Explorer 7", "Microsoft Internet Explorer 8", "Mozilla Firefox 3", "Mozilla Firefox 4", "Google Chrome 14"], 7, 20): 
    print entry 

Google Chrome 14 
Microso...et Explorer 6 
Microso...et Explorer 7 
Microso...et Explorer 8 
Mozilla Firefox 3 
Mozilla Firefox 4 
>>> for entry in ellipsize(["Minutes of Company Meeting, 5/25/2010 -- Internal use only", "Minutes of Company Meeting, 6/24/2010 -- Internal use only", "Minutes of Company Meeting, 7/23/2010 -- Internal use only"], 15, 40): 
    print entry 

Minutes of Comp...5/25/2010 -- Internal use... 
Minutes of Comp...6/24/2010 -- Internal use... 
Minutes of Comp...7/23/2010 -- Internal use...

來源

2011-02-14 22:01:35 user470379

省略一組名稱

回答

相關問題