單詞出現次數的計數

我正在尋找更好的SAS方法來計算某個單詞出現在字符串中的次數。例如，搜索字符串中的「木」：單詞出現次數的計數

how much wood could a woodchuck chuck if a woodchuck could chuck wood

...將返回2結果。

這是我通常會做，但它的很多代碼：

data _null_; 
    length sentence word $200; 

    sentence = 'how much wood could a woodchuck chuck if a woodchuck could chuck wood'; 
    search_term = 'wood'; 
    found_count = 0; 

    cnt=1; 
    word = scan(sentence,cnt); 
    do while (word ne ''); 
    num_times_found = sum(num_times_found, word eq search_term); 
    cnt = cnt + 1; 
    word = scan(sentence,cnt); 
    end; 

    put num_times_found=; 

run;

我可以把這個變成一個fcmp功能，使其更加優雅，但我仍然覺得自己必須有更友好，更簡潔的代碼。

來源

2016-02-12 Robert Penridge

我在這裏發佈了這個而不是codereview，因爲我不認爲codereview會有任何SAS受衆。 –

這不就是countW麼？ –

@data_null_不 - 這是我第一次想到的，但'countw（）'只是計算單詞的總數，而不是特定單詞出現的次數。 –

從Code Review的角度來看，以上可以有所改進。 do循環可以處理cnt增量，如果將其切換爲until，則不必執行初始分配。你也有一個無關的變量found_count，不知道那是什麼。否則，我認爲這是合理的，至少對於非複雜的解決方案而言。

data _null_; 
    length sentence word $200; 

    sentence = 'how much wood could a woodchuck chuck if a woodchuck could chuck wood'; 
    search_term = 'wood'; 

    do cnt=1 by 1 until (word eq ''); 
    word = scan(sentence,cnt); 
    num_times_found = sum(num_times_found, word eq search_term); 
    end; 

    put num_times_found=; 

run;

它也相當快 - 1e6迭代在我的盒子上不到9秒。當o被添加到字符串選項時，PRX解決方案需要更少的時間（6秒），所以在使用非常大的數據集或大量變量時可能更可取，但我相信與I/O時間相比，增加的時間將會很重要。 FCMP解決方案與此解決方案具有相同的時間順序（大約8-9秒）。最後，FINDW解決方案是最快的，大約2秒。

來源

2016-02-12 16:35:35 Joe

嘗試用prxchange掉落木頭，然後countw。

data _null_; 
sentence = 'how much wood could a woodchuck chuck if a woodchuck could chuck wood'; 
count=countw(sentence,' ')-countw(prxchange('s/wood/$1/i',-1,sentence),' '); 
put _all_; 
run;

來源

2016-02-12 16:35:01

從技術上講，這當然會將'土撥鼠'翻譯爲'卡盤'，但這並不影響結果。 – Joe

而這正是我所說的'錯綜複雜的解決方案' - 不是因爲它錯了，而是它不那麼直截了當，並且可以根據這個原則避免（因爲其他人很難看到你是什麼這樣做）。 – Joe

您可以將'o'選項添加到您的prx中，否則運行多次迭代需要相當長的時間。 – Joe

以及物品是否完整，這是作爲一個鈣鎂磷肥功能：

鈣鎂磷肥定義：

options cmplib=work.temp.temp; 

proc fcmp outlib=work.temp.temp; 

    function word_freq(sentence $, search_term $) ;  
    length sentence word $200; 

    do cnt=1 by 1 until (word eq ''); 
     word = scan(sentence,cnt); 
     num_times_found = sum(num_times_found, word eq search_term); 
    end; 

    return (num_times_found); 
    endsub; 

run;

用法：

data _null_; 
    num_times_found = word_freq('how much wood could a woodchuck chuck if a woodchuck could chuck wood','wood'); 
    put num_times_found=; 
run;

結果：

num_times_found=2

來源

2016-02-12 17:11:43

當FINDW將有效掃描您時，沒有理由掃描所有單詞。

33   data _null_; 
34   length sentence search_term $200; 
35   sentence = 'how much wood could a woodchuck chuck if a woodchuck could chuck wood'; 
36   search_term = 'wood'; 
37   cnt=0; 
38   do s=findw(sentence,strip(search_term),1) by 0 while(s); 
39    cnt+1; 
40    s=findw(sentence,strip(search_term),s+1); 
41    end; 
42   put cnt= search_term=; 
43   stop; 
44   run; 

cnt=2 search_term=wood

來源

2016-02-12 17:59:12

絕對比SCAN方法快很多。 – Joe

單詞出現次數的計數

回答

相關問題