走過字符串猜測基於名稱字典的電子郵件名稱？

比方說，我有一個名稱字典（一個巨大的CSV文件）。我想從一個沒有明顯的可解析點（。， - ，_）的電子郵件中猜出一個名字。我想要做這樣的事情：走過字符串猜測基於名稱字典的電子郵件名稱？

dict = ["sam", "joe", "john", "parker", "jane", "smith", "doe"] 
    word = "johnsmith" 
    x = 0 
    y = word.length-1 
    name_array = [] 
    for i in x..y 
    match_me = word[x..i] 
    dict.each do |name| 
     if match_me == name 
     name_array << name 
     end 
    end 
    end 

    name_array 
    # => ["john"]

不壞，但我想要的「約翰·史密斯」或[「約翰」，「史密斯」]

換句話說，我遞歸遍歷字（即，未分析的電子郵件字符串，「[email protected]」），直到我在字典中找到匹配。 我知道：這是非常低效的。如果有更簡單的方法來做到這一點，我全是耳朵！

如果沒有更好的方法去做，那麼請告訴我如何解決上面的例子，因爲它有兩個主要缺陷：（1）我如何設置循環的長度（請參閱找到「我（2）如何在上面的例子中增加「x」，這樣我就可以在給定任意字符串的情況下遍歷所有可能的字符組合？

問題，找到環路的長度，「我」的：

for an arbitrary word, how can we derive "i" given the pattern below? 

    for a (i = 1) 
    a 

    for ab (i = 3) 
    a 
    ab 
    b 

    for abc (i = 6) 
    a 
    ab 
    abc 
    b 
    bc 
    c 

    for abcd (i = 10) 
    a 
    ab 
    abc 
    abcd 
    b 
    bc 
    bcd 
    c 
    cd 
    d 

    for abcde (i = 15) 
    a 
    ab 
    abc 
    abcd 
    abcde 
    b 
    bc 
    bcd 
    bcde 
    c 
    cd 
    cde 
    d 
    de 
    e

來源

2011-11-25 MorningHacker

進一步的研究表明，可以使用三角形序列序列來導出「i」：a（n）= C（n + 1,2）= n（n + 1）/ 2 = 0 + 1 + 2 +。 .. + N。 http://oeis.org/search?q=1%2C+3%2C+6%2C+10%2C+15&language=english&go=Search – MorningHacker

我不敢建議蠻力解決方案，是不是很優雅，但仍然有用的情況下

你有大量的項目（構建正則表達式可能很痛苦）
要分析的字符串不限於兩個組件
要獲取字符串的所有分割
您只需要完整分析字符串，即從^到$。

因爲我的英語不好，我無法找出可以在不止一種方式被分裂的長期個人的名義，讓我們分析一個短語：

word = "godisnowhere"

字典：

@dict = [ "god", "is", "now", "here", "nowhere", "no", "where" ] 

@lengths = @dict.collect {|w| w.length }.uniq.sort

數組@lengths增加了對算法的輕微優化，我們將使用它來修剪詞典中不存在的詞長度的子詞，而不實際執行詞典查找。該數組是排序的，這是另一個優化。

解決方案的主要部分是一個遞歸函數，它可以查找給定單詞中的初始子字，並重新開始處理尾部子字。

def find_head_substring(word) 

    # boundary condition: 
    # remaining subword is shorter than the shortest word in @dict 
    return [] if word.length < @lengths[0] 

    splittings = [] 

    @lengths.each do |len| 
    break if len > word.length 

    head = word[0,len] 

    if @dict.include?(head) 
     tail = word[len..-1] 

     if tail.length == 0 
     splittings << head 
     else 
     tails = find_head_substring(tail) 
     unless tails.empty? 
      tails.collect!{|tail| "#{head} #{tail}" } 
      splittings.concat tails 
     end 
     end 
    end 
    end 

    return splittings 
end

現在來看看它是如何工作

find_head_substring(word) 
=>["god is no where", "god is now here", "god is nowhere"]

我沒有測試過廣泛的，所以我提前:)道歉

來源

2011-11-25 16:20:31

我喜歡這裏的前進方向，但是當「j」不在字典中時，這種方法對「johnjsmith」有困難。 @錫文的方法似乎忽略了「j」並在字符串內找到其他匹配。 – MorningHacker

雖然...它看起來像我可以將所有單個字母的字母添加到@dict。在這種情況下，你的方法返回「john j smith」。非常好！ – MorningHacker

我不知道你和我在做什麼，而不是它簡單：

dict.each do |first| 
    dict.each do |last| 
     puts first,last if first+last == word 
    end 
end

來源

2011-11-25 08:41:24 pguardiario

r = /^(#{Regexp.union(dict)})(#{Regexp.union(dict)})$/ 
word.match(r) 
=> #<MatchData "johnsmith" 1:"john" 2:"smith">

正則表達式可能需要一些時間才能構建，但速度非常快。

來源

2011-11-25 08:56:27 Reactormonk

我喜歡它，但我認爲你想要^ $界限 – pguardiario

什麼是^ $邊界爲？ – MorningHacker

字符串的開始/結尾 – Reactormonk

這一個包所有出現，不一定正好有兩個：

pattern = Regexp.union(dict) 
matches = [] 
while match = word.match(pattern) 
    matches << match.to_s # Or just leave off to_s to keep the match itself 
    word = match.post_match 
end 
matches

來源

2011-11-25 09:05:55 phs

如果你只是想在你的字典比賽的命中：

dict.select{ |r| word[/#{r}/] } 
=> ["john", "smith"]

你冒着太多令人困惑的子目錄的風險，所以你可能想排序你的字典如此之久R名稱是第一：

dict.sort_by{ |w| -w.size }.select{ |r| word[/#{r}/] } 
=> ["smith", "john"]

您仍然遇到這樣的情況，其中一個較長的名稱具有更短的子以下，並得到多次點擊，所以你需要找出一種方法來剔除那些出來。你可以有一個名字和另一個姓氏的數組，並獲取第一個返回的掃描結果，但考慮到名字和姓氏的多樣性，這並不能保證100％的準確性，並且仍然會收集一些結果不好。

這種問題沒有真正的好的解決方案，沒有進一步提示有關人的名字的代碼。也許掃描消息的主體，以稱呼或valediction部分將有所幫助。

來源

2011-11-25 16:58:43

走過字符串猜測基於名稱字典的電子郵件名稱？

回答

相關問題