2015-04-06 63 views
0

我想知道,如何訓練SVM,將整個文檔作爲輸入併爲該輸入文檔指定單個標籤。 我已經標記了一個字,直到現在。例如,輸入文檔可以包含6到10個句子,並且整個文檔將被標記爲單個類別用於訓練。使用SVM爲整個文檔提供單個標記

回答

1

的基本方法是如下:

  1. 創建培訓文件和標籤/類的列表。
  2. 標記您的培訓文件。
  3. 刪除文檔中的停用詞。
  4. 爲您的文檔創建TF-IDF值。
  5. 將您的TF-IDF值限制爲N個最常見的值。 N = 1000。
  6. 在有限的TF-IDF數據和您的標籤上訓練SVM。

然後你有一個分類器可以將TF-IDF格式的文檔映射到類標籤。因此,您可以在將測試文檔轉換爲類似的TF-IDF格式後對其進行分類。

這裏是用Python scikit對於作爲分類文檔的SVM的例子無論是關於狐狸或城市:

from sklearn import svm 
from sklearn.feature_extraction.text import TfidfVectorizer 

# Training examples (already tokenized, 6x fox and 6x city) 
docs_train = [ 
    "The fox jumped over the fence .", 
    "The fox sleeps under the tree .", 
    "A fox walks through the high grass .", 
    "Didn 't see a single fox today .", 
    "I saw a fox yesterday near the lake .", 
    "You might encounter foxes at the lake .", 

    "New York City is full of skyscrapers .", 
    "Los Angeles is a city on the west coast .", 
    "I 've been to Los Angeles before .", 
    "Let 's travel to Mexico City .", 
    "There are no skyscrapers in Washington .", 
    "Washington is a beautiful city ." 
] 

# Test examples (already tokenized, 2x fox and 2x city) 
docs_test = [ 
    "There 's a fox in the garden .", 
    "Did you see the fox next to the tree ?", 
    "What 's the shortest way to Los Alamos ?", 
    "Traffic in New York is a pain" 
] 

# Labels of training examples (6x fox and 6x city) 
y_train = ["fox", "fox", "fox", "fox", "fox", "fox", 
      "city", "city", "city", "city", "city", "city"] 

# Convert training and test examples to TFIDF 
# The vectorizer also removes stopwords and converts the texts to lowercase. 
vectorizer = TfidfVectorizer(max_df=1.0, max_features=10000, 
          min_df=0, stop_words='english') 

vectorizer.fit(docs_train + docs_test) 

X_train = vectorizer.transform(docs_train) 
X_test = vectorizer.transform(docs_test) 

# Train an SVM on TFIDF data of the training documents 
clf = svm.SVC() 
clf.fit(X_train, y_train) 

# Test the SVM on TFIDF data of the test documents 
print clf.predict(X_test) 

輸出爲預期(2X狐狸和2個城市):

['fox' 'fox' 'city' 'city']