C語言頻率統計（非C++）

給定一個由單個空格分隔的單詞組成的字符串，按照它們出現在字符串中的次數排序，以降序打印單詞。C語言頻率統計（非C++）

例如「AB BC BC」的輸入串將產生以下輸出：

bc : 2 
ab : 1

這個問題將如果C++的數據結構，如地圖，使用可以容易地得到解決。但是，如果問題只能在普通的老C中解決，那看起來就更難了。

我應該在這裏使用什麼樣的數據結構和算法？請儘可能詳細。我在DS和Algo方面很弱。 :-(

來源

2012-01-04 Qiang Xu

您可以使用鏈接的結構列表。 – CanSpice 2012-01-04 17:12:09

沒有時間進入細節，看看'trie's。 – 2012-01-04 17:12:51

下面是如何我做一個樣本。 findWord（）中的搜索可以被優化。分配數量也可以通過分配文字塊而不是一次一個來減少。也可以爲這種情況實現一個鏈表。它缺少內存釋放。這應該有希望讓你去。

#include <stdio.h> 
    #include <assert.h> 
    #include <stdlib.h> 

    #define MAXWORDLEN 128 

    const char* findWhitespace(const char* text) 
    { 
     while (*text && !isspace(*text)) 
      text++; 
     return text; 
    } 

    const char* findNonWhitespace(const char* text) 
    { 
     while (*text && isspace(*text)) 
      text++; 
     return text; 
    } 

    typedef struct tagWord 
    { 
     char word[MAXWORDLEN + 1]; 
     int count; 
    } Word; 

    typedef struct tagWordList 
    { 
     Word* words; 
     int count; 
    } WordList; 

    WordList* createWordList(unsigned int count); 

    void extendWordList(WordList* wordList, const int count) 
    { 
     Word* newWords = (Word*)malloc(sizeof(Word) * (wordList->count + count)); 
     if (wordList->words != NULL) { 
      memcpy(newWords, wordList->words, sizeof(Word)* wordList->count); 
      free(wordList->words); 
     } 
     for (int i = wordList->count; i < wordList->count + count; i++) { 
      newWords[i].word[0] = '\0'; 
      newWords[i].count = 0; 
     } 
     wordList->words = newWords; 
     wordList->count += count; 
    } 

    void addWord(WordList* wordList, const char* word) 
    { 
     assert(strlen(word) <= MAXWORDLEN); 
     extendWordList(wordList, 1); 
     Word* wordNode = &wordList->words[wordList->count - 1]; 
     strcpy(wordNode->word, word); 
     wordNode->count++; 
    } 

    Word* findWord(WordList* wordList, const char* word) 
    { 
     for(int i = 0; i < wordList->count; i++) { 
      if (stricmp(word, wordList->words[i].word) == 0) { 
       return &wordList->words[i]; 
      } 
     } 
     return NULL; 
    } 

    void updateWordList(WordList* wordList, const char* word) 
    { 
     Word* foundWord = findWord(wordList, word); 
     if (foundWord == NULL) { 
      addWord(wordList, word); 
     } else { 
      foundWord->count++; 
     } 
    } 

    WordList* createWordList(unsigned int count) 
    { 
     WordList* wordList = (WordList*)malloc(sizeof(WordList)); 
     if (count > 0) { 
      wordList->words = (Word*)malloc(sizeof(Word) * count); 
      for(unsigned int i = 0; i < count; i++) { 
       wordList->words[i].count = 0; 
       wordList->words[i].word[0] = '\0'; 
      } 
     } 
     else { 
      wordList->words = NULL; 
     } 
     wordList->count = count;  
     return wordList; 
    } 

    void printWords(WordList* wordList) 
    { 
     for (int i = 0; i < wordList->count; i++) { 
      printf("%s: %d\n", wordList->words[i].word, wordList->words[i].count); 
     } 
    } 

    int compareWord(const void* vword1, const void* vword2) 
    { 
     Word* word1 = (Word*)vword1; 
     Word* word2 = (Word*)vword2; 
     return strcmp(word1->word, word2->word); 
    } 

    void sortWordList(WordList* wordList) 
    { 
     qsort(wordList->words, wordList->count, sizeof(Word), compareWord); 
    } 

    void countWords(const char* text) 
    { 
     WordList *wordList = createWordList(0); 
     Word  *foundWord = NULL; 
     const char *beg = findNonWhitespace(text); 
     const char *end; 
     char  word[MAXWORDLEN]; 

     while (beg && *beg) { 
      end = findWhitespace(beg); 
      if (*end) { 
       assert(end - beg <= MAXWORDLEN); 
       strncpy(word, beg, end - beg); 
       word[end - beg] = '\0'; 
       updateWordList(wordList, word); 
       beg = findNonWhitespace(end); 
      } 
      else { 
       beg = NULL; 
      } 
     } 

     sortWordList(wordList); 
     printWords(wordList); 
    } 

int main(int argc, char* argv[]) 
{ 
    char* text = "abc 123 abc 456 def 789 \tyup this \r\ncan work yup 456 it can"; 
    countWords(text); 
}

來源

2012-01-04 19:28:18 Jack

謝謝傑克！我會在免費時仔細查看示例代碼。 – 2012-01-04 20:38:00

要在文本中查找單詞，可以使用現有的strtok。由於您可能不想修改原始文本（並且strtok會將分隔符更改爲空終止符），因此可以先將整個緩衝區複製到一個專用緩衝區中。雖然在POSIX上有一個strtok_r，但代碼當然不會重入（線程安全）。或者，只需複製整個緩衝區，將副本中的空白轉爲空終止符，同時將該詞的起始位置保留在不同的指針中。 – CashCow 2012-01-05 09:33:33

@CashCow：非常感謝您對細膩部分的詳細解釋。 – 2012-01-05 16:04:31

，你可以使用一個數據結構，其中包含您可以比較使用的strcmp（我會忽略大小寫問題現在）。也就是說一個簡單的二進制樹

您需要確保樹保持平衡你可以看到AVL樹或1-2棵樹或維基百科或其他地方的紅黑樹

我不會給出太多的細節，除了要創建一個二叉樹結構，每個結點都會有一個可能爲空的左右子節點，對於一個葉子節點，兩個子節點都是空的。爲了使它更簡單，使用具有該值和兩個子節點的「侵入式」節點，如：

struct Node 
{ 
    char * value; 
    size_t frequency; 
    struct Node * left; 
    struct Node * right; 
};

和顯然是C你需要做所有的內存管理。

您將有一個函數向下遞歸樹，根據需要進行比較和向左或向右移動。如果發現它會提高頻率。如果不是，你的函數應該能夠確定插入節點的位置，然後插入和重新平衡邏輯。當然，新節點將包含頻率爲1的字。

最後，您將需要一種方法來通過樹來打印結果。在你的情況下，這可以是一個遞歸函數。

請注意，替代數據結構會是某種散列表。

如果您正在尋找最高效的解決方案並且手頭有大量內存，那麼您將使用一種數據結構，從而在您遇到它時分行瀏覽每個字母。所以「a」給你所有以a開頭的單詞，然後轉到第二個字母「b」等等。對於不知道數據結構的人來說實現起來相當複雜，所以我建議你去用簡單的二叉樹。

請注意，在打印時，它不會與頻率相反，因此您必須先對結果進行排序。（在使用C++的地圖中，你也不會按順序得到它們）。

來源

2012-01-04 17:13:11 CashCow

謝謝CashCow。這個想法很清楚。我會盡力實現它，這對我來說有點困難（從來沒有認真接觸過樹型東西）。順便說一句，任何建議，以便在完全建立之後整理出三個（以便按照freq的順序打印文字）？ – 2012-01-04 17:46:15

@Qiang Xu：在我的回覆中查看我的示例，在其中您將看到如何使用qsort（）。 – Jack 2012-01-04 19:38:06

要對結果進行排序，只需使用qsort，這是C標準的一部分。 – CashCow 2012-01-05 14:53:44

我會爲此使用三元樹。其中數據結構由喬恩·本特利和羅伯特·塞奇威克介紹下面的文章有一個例子C.

http://www.cs.princeton.edu/~rs/strings/

來源

2012-01-04 17:18:03

想知道Jon Bentley的「Programming Peals」第二版是否涵蓋了這個主題？聽說這是一個經典，但沒有機會把它放在手上。 – 2012-01-04 20:39:42

我有它在家 - 我會看看我是否可以記得今晚查找它。 – 2012-01-04 21:40:33

C語言頻率統計（非C++）

回答

相關問題