2015-06-16 20 views
0

所以我試圖完成的任務是從字符串s中獲取長度爲n的所有可能模式的頻率列表。哈斯克爾頻率計數

Input (Text,Length) 
Output (Frequency) 
String -> Int -> [Int] 
freqCount s n = frequency list [int] in alphabetical order 

[ 「A」, 「B」, 「C」, 「d」]

其中s中的字符被限定於以上所提到的四個,所以我第一步驟是獲得長度爲n的所有可能的排列都允許重複。

permutationsR k = sort(replicateM k ['A','B','C','D']) 

例如,

permutationsR 2 

會給輸出

["AA","AB","AC","AD","BA","BB","BC","BD","CA","CB","CC","CD","DA","DB","DC","DD"] 

,然後每個模式得到了多少次,每次發生計數 所以像

patternCount:: String -> String -> Int 
patternCount text pattern = length (filter (\x -> x == pattern) [take (length(pattern)) (drop x text)| x <- [0..length(text)- length(pattern)]]) 
frequencyCount s n = map (\x -> patternCount s x) (permutationsR n) 

但是我想,這是怎麼回事是非常低效的,因爲我基本上是通過整個列表來檢查每個模式長度(permutationsRn)次,而不是我理由應該是poss只需要一次迭代即可完成。

有沒有一種方法可以生成頻率圖,就像我在命令式語言中那樣。

即僞

where s = string 
and n = length of pattern 
//pattern is a map where key = pattern and value = frequencyCount 

patterns = {"AA":0,"AB":0,"AC:0...} 
for (i = 0; i < (length s - n); i++){ 
    patterns[s[i:(i+n)]] += 1 
} 

基本上只是通過迭代一次,從分割的字符串(I:I + N) 和更新模式映射與每次出現。

例如輸入,輸出將是這樣的

s= "AABBCA" 
n = 2 
frequencyList s n = [1,1,0,0,0,1,1,0,1,0,0,0,0,0,0,0] 

回答

3

這裏的一個可能的解決方案:

import Data.List 

groupsOf n str = 
    unfoldr (\s -> 
      if length s < n 
      then Nothing 
      else Just (take n s, tail s)) str 

frequency :: (Ord t, Eq t) => [t] -> [(t, Int)] 
frequency = 
    map (\s -> (head s, length s)) . group . sort 

groupsOf將輸入字符串成重疊的長度爲n的序列。例如,

groupsOf 3 "AABCABC" 

會給

["AAB", "ABC", "BCA", "CAB", "ABC"]. 

frequency然後將統計每個子的出現,所以

frequency $ groupsOf 3 "AABCABC" 

應該給

[("AAB", 1), ("ABC", 2), ("BCA", 1), ("CAB", 1)]. 

任何subsequenc e不會在結果中出現零次。