斯坦福NLP訓練n-gram NER

最近我一直試圖用斯坦福核心NLP訓練n-gram實體。我遵循以下教程 - http://nlp.stanford.edu/software/crf-faq.shtml#b 斯坦福NLP訓練n-gram NER

使用此功能，我只能指定單字符標記及其所屬的類。任何人都可以引導我，讓我可以將它擴展到n-gram。我試圖從聊天數據集中提取已知的實體，如電影名稱。

如果我錯誤地解釋了斯坦福教程並且可以用於n-gram培訓，請指導我。

什麼我堅持的是下列財產

#structure of your training file; this tells the classifier 
#that the word is in column 0 and the correct answer is in 
#column 1 
map = word=0,answer=1

這裏的第一列是字（單gram），第二列是實體，例如

CHAPTER O 
I O 
Emma PERS 
Woodhouse PERS

現在，我需要培訓像綠巨人,泰坦尼克等已知實體（比如電影名稱）作爲電影，這種方法很容易。但如果我需要訓練我知道你去年夏天做了什麼或寶寶出門，最好的方法是什麼？

來源

2013-03-25 Arun A K

尊敬的@Arun您是否成功地培訓NER爲n-grams？我想培養像科學碩士：教育，電子博士學位：教育。你能指導我嗎？謝謝 – 2017-01-19 13:43:27

@KhalidUsman，感謝您的支持。我已經在下面的答案中使用了LingPipe來實現這一點。訓練數據集體積相當不錯。任何模型都可以正常工作，這取決於你提供的數據集有多好。 – 2017-01-19 16:48:32

在這裏等待答案已經很久了。我一直無法想出使用斯坦福核心來完成它的方式。然而任務完成。我已經使用了LingPipe NLP庫。在這裏引用答案是因爲我認爲別人可以從中受益。

如果您是開發人員或研究人員，或在任何情況下進行實施，請先查看Lingpipe licencing。

Lingpipe提供了各種NER方法。

1）基於字典的NER

2）統計NER（HMM基於）

3）基於規則的NER等

我已經使用了字典以及所述統計方法。

第一個是直接查找方法，第二個是基於培訓。

爲基於字典NER的例子可以發現here

的statstical方法需要培訓檔案。我已經使用了以下格式的文件 -

<root> 
<s> data line with the <ENAMEX TYPE="myentity">entity1</ENAMEX> to be trained</s> 
... 
<s> with the <ENAMEX TYPE="myentity">entity2</ENAMEX> annotated </s> 
</root>

然後我使用下面的代碼來訓練實體。

import java.io.File; 
import java.io.IOException; 

import com.aliasi.chunk.CharLmHmmChunker; 
import com.aliasi.corpus.parsers.Muc6ChunkParser; 
import com.aliasi.hmm.HmmCharLmEstimator; 
import com.aliasi.tokenizer.IndoEuropeanTokenizerFactory; 
import com.aliasi.tokenizer.TokenizerFactory; 
import com.aliasi.util.AbstractExternalizable; 

@SuppressWarnings("deprecation") 
public class TrainEntities { 

    static final int MAX_N_GRAM = 50; 
    static final int NUM_CHARS = 300; 
    static final double LM_INTERPOLATION = MAX_N_GRAM; // default behavior 

    public static void main(String[] args) throws IOException { 
     File corpusFile = new File("inputfile.txt");// my annotated file 
     File modelFile = new File("outputmodelfile.model"); 

     System.out.println("Setting up Chunker Estimator"); 
     TokenizerFactory factory 
      = IndoEuropeanTokenizerFactory.INSTANCE; 
     HmmCharLmEstimator hmmEstimator 
      = new HmmCharLmEstimator(MAX_N_GRAM,NUM_CHARS,LM_INTERPOLATION); 
     CharLmHmmChunker chunkerEstimator 
      = new CharLmHmmChunker(factory,hmmEstimator); 

     System.out.println("Setting up Data Parser"); 
     Muc6ChunkParser parser = new Muc6ChunkParser(); 
     parser.setHandler(chunkerEstimator); 

     System.out.println("Training with Data from File=" + corpusFile); 
     parser.parse(corpusFile); 

     System.out.println("Compiling and Writing Model to File=" + modelFile); 
     AbstractExternalizable.compileTo(chunkerEstimator,modelFile); 
    } 

}

，並測試我用下面的類的NER

import java.io.BufferedReader; 
import java.io.File; 
import java.io.FileReader; 
import java.util.ArrayList; 
import java.util.Set; 

import com.aliasi.chunk.Chunk; 
import com.aliasi.chunk.Chunker; 
import com.aliasi.chunk.Chunking; 
import com.aliasi.util.AbstractExternalizable; 

public class Recognition { 
    public static void main(String[] args) throws Exception { 
     File modelFile = new File("outputmodelfile.model"); 
     Chunker chunker = (Chunker) AbstractExternalizable 
       .readObject(modelFile); 
     String testString="my test string"; 
      Chunking chunking = chunker.chunk(testString); 
      Set<Chunk> test = chunking.chunkSet(); 
      for (Chunk c : test) { 
       System.out.println(testString + " : " 
         + testString.substring(c.start(), c.end()) + " >> " 
         + c.type()); 

     } 
    } 
}

代碼提供者：谷歌:)

來源

2013-04-15 14:15:43

http://tech.groups.yahoo.com/group/LingPipe/message/68提供了有關語料庫準備的更多信息。 – 2013-05-10 05:50:20

我也試過相同的代碼。你能否提一下你是如何準備訓練集的？我把它作爲一個文本文件添加進去了，並試圖添加我自己的實體但它不起作用...... plz幫助我。我不知道我是否誤解了訓練集 – lulu 2014-04-19 17:28:53

的美國航空乘務員在作出短飛行夏洛特，飛機的後 NC，不停地偷看在第21行的一個座位，使得9個月大的笑聲變成了9個月大的笑臉。 – lulu 2014-04-19 17:32:23

答案基本上是在引用的例子給出，其中「艾瑪伍德豪斯」是一個名字。我們提供的默認模型使用IO編碼，並假定相同類的相鄰標記是同一個實體的一部分。在很多情況下，這幾乎總是如此，並且保持模型更簡單。但是，如果你不想這樣做，你可以訓練與其他標籤編碼，如常用的IOB編碼，在那裏你會代替標籤的東西NER型號：

Emma B-PERSON 
Woodhouse I-PERSON

再將相同的相鄰的標記可以表示類別但不是相同的實體。

來源

2013-07-10 03:40:49

謝謝@Chris，讓我嘗試用這種編碼格式創建一個新模型。 – 2013-07-11 06:19:13

@ChristopherManning如何在NER中啓用IOB編碼？ Thx – 2014-01-30 21:54:17

我在這個問題的答案中提供了IOB編碼選項的討論：http://stackoverflow.com/questions/21469082/how-do-i-use-iob-tags-with-stanford-ner – 2014-02-23 03:58:46

我面臨着爲automative domain標記ngram短語的相同挑戰。我一直在尋找一種高效的關鍵字映射，可用於在稍後階段創建培訓文件。我最終在NLP管道中使用了regexner，提供了一個帶有正則表達式（ngram組件術語）和它們相應標籤的映射文件。請注意，在這種情況下沒有實現NER機器學習。希望這些信息有助於某人！

來源

2016-10-04 04:40:34

斯坦福NLP訓練n-gram NER

回答

相關問題