忽略幾個不同的單詞.. C++？

我正在閱讀幾個文檔，並將我讀入的單詞編入索引。但是，我想忽略常見的單詞（a，an，and，is或is，等等）。忽略幾個不同的單詞.. C++？

有沒有這樣做的捷徑？不僅僅是...

if（word ==「and」|| word ==「is || || etc etc ....）ignore word;

例如，我可以將它們以某種方式放入一個const字符串中，並且只是檢查字符串？不知道...謝謝！

2012-04-15 Heather Wilson

搜索 '一站式' 的話...... http://databases.aspfaq.com/database/how-do-i-ignore-common-words-in-a-search.html – 2012-04-15 00:44:25

用您希望排除的詞創建一個set<string>，並使用mySet.count(word)來確定單詞是否在該集合中。如果是，計數將是1;否則將是0。

#include <iostream> 
#include <set> 
#include <string> 
using namespace std; 

int main() { 
    const char *words[] = {"a", "an", "the"}; 
    set<string> wordSet(words, words+3); 
    cerr << wordSet.count("the") << endl; 
    cerr << wordSet.count("quick") << endl; 
    return 0; 
}

來源

2012-04-15 00:47:24 dasblinkenlight

並以C++ 11，你甚至可以寫'set words {「和」，「是」，...}「。 – Philipp 2012-04-15 00:48:39

這基本上是正確的答案，但請考慮[爲什麼你不應該使用set（以及你應該使用什麼）]（http://lafstern.org/matt/col1.pdf）中的參數。 – bames53 2012-04-15 00:50:48

@ bames53這是一個有趣的論點，但它並不認爲該集合是壞的，只是有更經濟的東西。我認爲在這種情況下，set的使用是OK的：使用排序向量來替換它的改進將是微不足道的，但解釋這種改變會需要很多擊鍵。 – dasblinkenlight 2012-04-15 00:59:46

您可以使用字符串數組，循環並匹配每個字符串，或者使用更優化的數據結構，如set或trie。

這裏有一個如何與正常的陣列做一個例子：

const char *commonWords[] = {"and", "is" ...}; 
int commonWordsLength = 2; // number of words in the array 

for (int i = 0; i < commonWordsLength; ++i) 
{ 
    if (!strcmp(word, commonWords[i])) 
    { 
     //ignore word; 
     break; 
    } 
}

注意這個例子不使用C++ STL，但你應該。

來源

2012-04-15 00:47:05 jli

如果你想最大化你應該創建一個索引樹性能....

http://en.wikipedia.org/wiki/Trie

...停止字....

http://en.wikipedia.org/wiki/Stop_words

有沒有標準的C++ trie數據結構，但是看到這個問題的第三方實現...

Trie implementation

如果你不能與困擾，並希望使用標準容器，使用最好的一個是unordered_set<string>這將使禁用詞在哈希表。

bool filter(const string& word) 
{ 
    static unordered_set<string> stopwords({"a", "an", "the"}); 
    return !stopwords.count(word); 
}

來源

2012-04-15 00:52:11

忽略幾個不同的單詞.. C++？

回答

相關問題