2016-02-12 124 views
2

我正在尋找更好的SAS方法來計算某個單詞出現在字符串中的次數。例如,搜索字符串中的「木」:單詞出現次數的計數

how much wood could a woodchuck chuck if a woodchuck could chuck wood 

...將返回2結果。

這是我通常會做,但它的很多代碼:

data _null_; 
    length sentence word $200; 

    sentence = 'how much wood could a woodchuck chuck if a woodchuck could chuck wood'; 
    search_term = 'wood'; 
    found_count = 0; 

    cnt=1; 
    word = scan(sentence,cnt); 
    do while (word ne ''); 
    num_times_found = sum(num_times_found, word eq search_term); 
    cnt = cnt + 1; 
    word = scan(sentence,cnt); 
    end; 

    put num_times_found=; 

run; 

我可以把這個變成一個fcmp功能,使其更加優雅,但我仍然覺得自己必須有更友好,更簡潔的代碼。

+0

我在這裏發佈了這個而不是codereview,因爲我不認爲codereview會有任何SAS受衆。 –

+0

這不就是countW麼? –

+0

@data_null_不 - 這是我第一次想到的,但'countw()'只是計算單詞的總數,而不是特定單詞出現的次數。 –

回答

3

從Code Review的角度來看,以上可以有所改進。 do循環可以處理cnt增量,如果將其切換爲until,則不必執行初始分配。你也有一個無關的變量found_count,不知道那是什麼。否則,我認爲這是合理的,至少對於非複雜的解決方案而言。

data _null_; 
    length sentence word $200; 

    sentence = 'how much wood could a woodchuck chuck if a woodchuck could chuck wood'; 
    search_term = 'wood'; 

    do cnt=1 by 1 until (word eq ''); 
    word = scan(sentence,cnt); 
    num_times_found = sum(num_times_found, word eq search_term); 
    end; 

    put num_times_found=; 

run; 

它也相當快 - 1e6迭代在我的盒子上不到9秒。當o被添加到字符串選項時,PRX解決方案需要更少的時間(6秒),所以在使用非常大的數據集或大量變量時可能更可取,但我相信與I/O時間相比,增加的時間將會很重要。 FCMP解決方案與此解決方案具有相同的時間順序(大約8-9秒)。最後,FINDW解決方案是最快的,大約2秒。

2

嘗試用prxchange掉落木頭,然後countw。

data _null_; 
sentence = 'how much wood could a woodchuck chuck if a woodchuck could chuck wood'; 
count=countw(sentence,' ')-countw(prxchange('s/wood/$1/i',-1,sentence),' '); 
put _all_; 
run; 
+0

從技術上講,這當然會將'土撥鼠'翻譯爲'卡盤',但這並不影響結果。 – Joe

+0

而這正是我所說的'錯綜複雜的解決方案' - 不是因爲它錯了,而是它不那麼直截了當,並且可以根據這個原則避免(因爲其他人很難看到你是什麼這樣做)。 – Joe

+0

您可以將'o'選項添加到您的prx中,否則運行多次迭代需要相當長的時間。 – Joe

2

以及物品是否完整,這是作爲一個鈣鎂磷肥功能:

鈣鎂磷肥定義:

options cmplib=work.temp.temp; 

proc fcmp outlib=work.temp.temp; 

    function word_freq(sentence $, search_term $) ;  
    length sentence word $200; 

    do cnt=1 by 1 until (word eq ''); 
     word = scan(sentence,cnt); 
     num_times_found = sum(num_times_found, word eq search_term); 
    end; 

    return (num_times_found); 
    endsub; 

run; 

用法:

data _null_; 
    num_times_found = word_freq('how much wood could a woodchuck chuck if a woodchuck could chuck wood','wood'); 
    put num_times_found=; 
run; 

結果:

num_times_found=2 
3

當FINDW將有效掃描您時,沒有理由掃描所有單詞。

33   data _null_; 
34   length sentence search_term $200; 
35   sentence = 'how much wood could a woodchuck chuck if a woodchuck could chuck wood'; 
36   search_term = 'wood'; 
37   cnt=0; 
38   do s=findw(sentence,strip(search_term),1) by 0 while(s); 
39    cnt+1; 
40    s=findw(sentence,strip(search_term),s+1); 
41    end; 
42   put cnt= search_term=; 
43   stop; 
44   run; 

cnt=2 search_term=wood 
+0

絕對比SCAN方法快很多。 – Joe

相關問題