排序字符串。保留每個字符串的前X個字符。如果此前綴對字符串前後不唯一,則前進至唯一字符(與前後字符串相比)。 (如果找不到唯一字符,則該字符串沒有唯一部分,請參閱帖子底部)在這些唯一字符前後添加省略號。
注意,這可能仍然看起來很有趣:
Microsoft Office -> Micro...ffice
Microsoft Outlook -> Micro...utlook
我不知道你要做到這一點用什麼語言,但這裏有一個Python實現。
def unique_index(before, current, after, size):
'''Returns the index of the first part of _current_ of length _size_ that is
unique to it, _before_, and _after_. If _current_ has no part unique to it,
_before_, and _after_, it returns the _size_ letters at the end of _current_'''
before_unique = False
after_unique = False
for i in range(len(current)-size):
#this will be incorrect in the case mentioned below
if i > len(before)-1 or before[i] != current[i]:
before_unique = True
if i > len(after)-1 or after[i] != current[i]:
after_unique = True
if before_unique and after_unique:
return i
return len(current)-size
def ellipsize(entries, prefix_size, max_string_length):
non_prefix_size = max_string_length - prefix_size #-len("...")? Post isn't clear about this.
#If you want to preserve order then make a copy and make a mapping from the copy to the original
entries.sort()
ellipsized = []
# you could probably remove all this indexing with something out of itertools
for i in range(len(entries)):
current = entries[i]
#entry is already short enough, don't need to truncate
if len(current) <= max_string_length:
ellipsized.append(current)
continue
#grab empty strings if there's no string before/after
if i == 0:
before = ''
else:
before = entries[i-1]
if i == len(entries)-1:
after = ''
else:
after = entries[i+1]
#Is the prefix unique? If so, we're done.
current_prefix = entries[i][:prefix_size]
if not before.startswith(current_prefix) and not after.startswith(current_prefix):
ellipsized.append(current[:max_string_length] + '...') #again, possibly -3
#Otherwise find the unique part after the prefix if it exists.
else:
index = prefix_size + unique_index(before[prefix_size:], current[prefix_size:], after[prefix_size:], non_prefix_size)
if index == prefix_size:
header = ''
else:
header = '...'
if index + non_prefix_size == len(current):
trailer = ''
else:
trailer = '...'
ellipsized.append(entries[i][:prefix_size] + header + entries[i][index:index+non_prefix_size] + trailer)
return ellipsized
另外,你提到的字符串本身是唯一的,但它們是否都有獨特的部分?例如,「Microsoft」和「Microsoft Internet Explorer 7」是兩個不同的字符串,但第一個沒有與第二個不同的部分。如果是這種情況,那麼您必須在規範中添加一些內容,以便使該案例明確無誤。 (如果添加「Xicrosoft」,「MXcrosoft」,「MiXrosoft」等與這兩個字符串混合在一起,則存在否唯一字符串比原始字符串短以代表「Microsoft」)(另一種思考方式它:如果你有所有可能的X字符串,你不能將它們全部壓縮到X-1或更少的字符串。就像沒有壓縮方法可以壓縮所有的輸入一樣,因爲這本質上是一種壓縮方法)。結果來自原帖:
>>> for entry in ellipsize(["Microsoft Internet Explorer 6", "Microsoft Internet Explorer 7", "Microsoft Internet Explorer 8", "Mozilla Firefox 3", "Mozilla Firefox 4", "Google Chrome 14"], 7, 20):
print entry
Google Chrome 14
Microso...et Explorer 6
Microso...et Explorer 7
Microso...et Explorer 8
Mozilla Firefox 3
Mozilla Firefox 4
>>> for entry in ellipsize(["Minutes of Company Meeting, 5/25/2010 -- Internal use only", "Minutes of Company Meeting, 6/24/2010 -- Internal use only", "Minutes of Company Meeting, 7/23/2010 -- Internal use only"], 15, 40):
print entry
Minutes of Comp...5/25/2010 -- Internal use...
Minutes of Comp...6/24/2010 -- Internal use...
Minutes of Comp...7/23/2010 -- Internal use...
嗯,這不是一個不好的起點,但我不認爲這是我想要的。也許我的例子沒有被選擇清楚,但我不要求橢圓只替換相等的子字符串:只是輸出字符串是唯一的。例如,如果給出兩個輸入「Herzkreislaufwiederbelebung」和「Geschwindigkeitsbegrenzung」,並且我想修剪到長度= 12(包括圓點),則可以返回「Herzkreis ...」和「Geschwind ...」。 – Ken 2011-02-14 20:37:23
@Ken聽起來像你可以讓他們呆滯。 – Orbling 2011-02-14 20:56:27