2017-02-21 68 views
4

要產生朱莉婭詞二元語法,我可以簡單地通過原始列表和下降的第一個元素的列表,如ZIP:生成的n-gram與朱莉婭

julia> s = split("the lazy fox jumps over the brown dog") 
8-element Array{SubString{String},1}: 
"the" 
"lazy" 
"fox" 
"jumps" 
"over" 
"the" 
"brown" 
"dog" 

julia> collect(zip(s, drop(s,1))) 
7-element Array{Tuple{SubString{String},SubString{String}},1}: 
("the","lazy") 
("lazy","fox") 
("fox","jumps") 
("jumps","over") 
("over","the") 
("the","brown") 
("brown","dog") 

要生成一個卦,我可以使用相同的collect(zip(...))成語來獲得:

julia> collect(zip(s, drop(s,1), drop(s,2))) 
6-element Array{Tuple{SubString{String},SubString{String},SubString{String}},1}: 
("the","lazy","fox") 
("lazy","fox","jumps") 
("fox","jumps","over") 
("jumps","over","the") 
("over","the","brown") 
("the","brown","dog") 

但我必須手動在第三列表中通過壓縮增加,有一個慣用的方式,這樣我可以做ň -gram的任何命令?

例如我想避免這樣做,以提取5克:

julia> collect(zip(s, drop(s,1), drop(s,2), drop(s,3), drop(s,4))) 
4-element Array{Tuple{SubString{String},SubString{String},SubString{String},SubString{String},SubString{String}},1}: 
("the","lazy","fox","jumps","over") 
("lazy","fox","jumps","over","the") 
("fox","jumps","over","the","brown") 
("jumps","over","the","brown","dog") 

回答

4

這是一個乾淨的單線程的任何長度的克。

ngram(s, n) = collect(zip((drop(s, k) for k = 0:n-1)...)) 

它使用一個發電機解析來遍歷元素,k的數量,以drop。然後,使用splat(...)運算符,它將Drop解包爲zip,最後將collect解包爲Array

julia> ngram(s, 2) 
7-element Array{Tuple{SubString{String},SubString{String}},1}: 
("the","lazy") 
("lazy","fox") 
("fox","jumps") 
("jumps","over") 
("over","the") 
("the","brown") 
("brown","dog") 

julia> ngram(s, 5) 
4-element Array{Tuple{SubString{String},SubString{String},SubString{String},SubString{String},SubString{String}},1}: 
("the","lazy","fox","jumps","over") 
("lazy","fox","jumps","over","the") 
("fox","jumps","over","the","brown") 
("jumps","over","the","brown","dog") 

正如你所看到的,這是非常相似的解決方案 - 只添加一個簡單的解析來遍歷元素的數量drop,使得其長度可以是動態的。

+0

很酷!謝謝@HarrisonGrodin,不知道'drop(s,0)'是可能的=) – alvas

+1

@alvas沒問題!而且,在「drop(s,0)」不可行的情況下,以下操作將起作用。 :)'zip(s,(drop(s,k)for k = 1:n-1)...)' –

5

另一種方法是使用Iterators.jlpartition()

ngram(s,n) = collect(partition(s, n, 1)) 
4

稍微改變了輸出和使用,而不是Tuple小號SubArray S,小損失,但它有可能避免分配和內存複製。如果底層單詞列表是靜態的,這是可以的並且更快(在我的基準測試中)。的代碼:

ngram(s,n) = [view(s,i:i+n-1) for i=1:length(s)-n+1] 

和輸出:

julia> ngram(s,5) 
SubString{String}["the","lazy","fox","jumps","over"] 
SubString{String}["lazy","fox","jumps","over","the"] 
SubString{String}["fox","jumps","over","the","brown"] 
SubString{String}["jumps","over","the","brown","dog"] 

julia> ngram(s,5)[1][3] 
"fox" 

對於較大的單詞表中的存儲器要求是相當小的也。

另請注意,使用生成器允許以更快的速度和更少的內存逐個處理ngrams,並且可能足夠用於所需的處理代碼(計數某物或通過一些散列)。例如,使用@ Gnimuc的解決方案,而沒有collect,即只有partition(s, n, 1)