0
我想知道,如何訓練SVM,將整個文檔作爲輸入併爲該輸入文檔指定單個標籤。 我已經標記了一個字,直到現在。例如,輸入文檔可以包含6到10個句子,並且整個文檔將被標記爲單個類別用於訓練。使用SVM爲整個文檔提供單個標記
我想知道,如何訓練SVM,將整個文檔作爲輸入併爲該輸入文檔指定單個標籤。 我已經標記了一個字,直到現在。例如,輸入文檔可以包含6到10個句子,並且整個文檔將被標記爲單個類別用於訓練。使用SVM爲整個文檔提供單個標記
的基本方法是如下:
然後你有一個分類器可以將TF-IDF格式的文檔映射到類標籤。因此,您可以在將測試文檔轉換爲類似的TF-IDF格式後對其進行分類。
這裏是用Python scikit對於作爲分類文檔的SVM的例子無論是關於狐狸或城市:
from sklearn import svm
from sklearn.feature_extraction.text import TfidfVectorizer
# Training examples (already tokenized, 6x fox and 6x city)
docs_train = [
"The fox jumped over the fence .",
"The fox sleeps under the tree .",
"A fox walks through the high grass .",
"Didn 't see a single fox today .",
"I saw a fox yesterday near the lake .",
"You might encounter foxes at the lake .",
"New York City is full of skyscrapers .",
"Los Angeles is a city on the west coast .",
"I 've been to Los Angeles before .",
"Let 's travel to Mexico City .",
"There are no skyscrapers in Washington .",
"Washington is a beautiful city ."
]
# Test examples (already tokenized, 2x fox and 2x city)
docs_test = [
"There 's a fox in the garden .",
"Did you see the fox next to the tree ?",
"What 's the shortest way to Los Alamos ?",
"Traffic in New York is a pain"
]
# Labels of training examples (6x fox and 6x city)
y_train = ["fox", "fox", "fox", "fox", "fox", "fox",
"city", "city", "city", "city", "city", "city"]
# Convert training and test examples to TFIDF
# The vectorizer also removes stopwords and converts the texts to lowercase.
vectorizer = TfidfVectorizer(max_df=1.0, max_features=10000,
min_df=0, stop_words='english')
vectorizer.fit(docs_train + docs_test)
X_train = vectorizer.transform(docs_train)
X_test = vectorizer.transform(docs_test)
# Train an SVM on TFIDF data of the training documents
clf = svm.SVC()
clf.fit(X_train, y_train)
# Test the SVM on TFIDF data of the test documents
print clf.predict(X_test)
輸出爲預期(2X狐狸和2個城市):
['fox' 'fox' 'city' 'city']