處理NLTK Stanford POS Tagger輸出

Im與NLTK Stanford Pos Tagger一起使用我自己的模型在文本文件中標記句子行。我惡搞的輸出是這樣的：處理NLTK Stanford POS Tagger輸出

sentences = [((Word,WordTag),....(Word,WordTag)]

進出口加工印尼語，Im做2步做詞性標註後：

停止詞刪除
詞幹

我已經在文本文件（stopword.txt）中得到了一個停用詞的列表，並將該句阻止。到目前爲止，我已經完成了標籤部分。我沒有任何想法如何過濾詞sentences如果他們刪除stopword.txt字和幹sentences

到目前爲止的話，我已經試過這個代碼字去掉，但仍不能消除的話和其字標籤：

stopWords = getStopWordList('id_stopword.txt') 
filtered_sentences = [w for w in sentences if not w in stopWords] 
    filtered_sentences = [] 
    for w in sentences: 
     if w not in stopWords: 
      filtered_sentences.append(w)

來源

2017-06-04 Fregy

我必須假設你的函數getStopWordList()正確返回一個字符串列表。（您是否驗證過？）

您發佈的代碼不會運行，因爲它有縮進錯誤。但縮進位不重要，因爲你不需要它。它顯然只是重複了之前的路線邏輯。所以我忽略了它。

要你需要改變這種過濾：

filtered_sentences = [w for w in sentences if not w in stopWords]

這樣：

filtered_sentences = [(w,t) for (w,t) in sentences if not w in stopWords]

來源

2017-06-04 10:05:58 BoarGules

是的，getStopWordList（）返回一個字符串列表。謝謝。現在剩下的問題是詞幹。我可以使用下面的代碼來阻止filtered_sentences嗎？ – Fregy

發佈關於詞幹的單獨問題。你的代碼沒有納入你的評論。可能太長了。 – BoarGules

處理NLTK Stanford POS Tagger輸出

回答

相關問題