使用SVM爲整個文檔提供單個標記

我想知道，如何訓練SVM，將整個文檔作爲輸入併爲該輸入文檔指定單個標籤。我已經標記了一個字，直到現在。例如，輸入文檔可以包含6到10個句子，並且整個文檔將被標記爲單個類別用於訓練。使用SVM爲整個文檔提供單個標記

2015-04-06 Sreetha

的基本方法是如下：

創建培訓文件和標籤/類的列表。
標記您的培訓文件。
刪除文檔中的停用詞。
爲您的文檔創建TF-IDF值。
將您的TF-IDF值限制爲N個最常見的值。 N = 1000。
在有限的TF-IDF數據和您的標籤上訓練SVM。

然後你有一個分類器可以將TF-IDF格式的文檔映射到類標籤。因此，您可以在將測試文檔轉換爲類似的TF-IDF格式後對其進行分類。

這裏是用Python scikit對於作爲分類文檔的SVM的例子無論是關於狐狸或城市：

from sklearn import svm 
from sklearn.feature_extraction.text import TfidfVectorizer 

# Training examples (already tokenized, 6x fox and 6x city) 
docs_train = [ 
    "The fox jumped over the fence .", 
    "The fox sleeps under the tree .", 
    "A fox walks through the high grass .", 
    "Didn 't see a single fox today .", 
    "I saw a fox yesterday near the lake .", 
    "You might encounter foxes at the lake .", 

    "New York City is full of skyscrapers .", 
    "Los Angeles is a city on the west coast .", 
    "I 've been to Los Angeles before .", 
    "Let 's travel to Mexico City .", 
    "There are no skyscrapers in Washington .", 
    "Washington is a beautiful city ." 
] 

# Test examples (already tokenized, 2x fox and 2x city) 
docs_test = [ 
    "There 's a fox in the garden .", 
    "Did you see the fox next to the tree ?", 
    "What 's the shortest way to Los Alamos ?", 
    "Traffic in New York is a pain" 
] 

# Labels of training examples (6x fox and 6x city) 
y_train = ["fox", "fox", "fox", "fox", "fox", "fox", 
      "city", "city", "city", "city", "city", "city"] 

# Convert training and test examples to TFIDF 
# The vectorizer also removes stopwords and converts the texts to lowercase. 
vectorizer = TfidfVectorizer(max_df=1.0, max_features=10000, 
          min_df=0, stop_words='english') 

vectorizer.fit(docs_train + docs_test) 

X_train = vectorizer.transform(docs_train) 
X_test = vectorizer.transform(docs_test) 

# Train an SVM on TFIDF data of the training documents 
clf = svm.SVC() 
clf.fit(X_train, y_train) 

# Test the SVM on TFIDF data of the test documents 
print clf.predict(X_test)

輸出爲預期（2X狐狸和2個城市）：

['fox' 'fox' 'city' 'city']

來源

2015-04-06 17:12:54 aleju

使用SVM爲整個文檔提供單個標記

回答

相關問題