2017-02-04 70 views
1

我想從查詢中獲取實體。如何針對自定義NameFinder模型進行OpenNLP培訓?

我有一個自定義的NameFinder模型。

查詢是這樣的。

result for roll number 1304510020. 
 
result for roll-number 1304510020. 
 
result for rollnumber 1304510020. 
 
result of rollnumber 1304510020. 
 
result of roll number 1304510020. 
 
result of roll-number 1304510020. 
 
roll number 1304510020 result. 
 
rollnumber 1304510020 result. 
 
roll-number 1304510020 result. 
 
show result of roll number 1304510020. 
 
show result of rollnumber 1304510020. 
 
show result of roll-number 1304510020. 
 
show my result for 1304510020. 
 
result of 1304510020.

這是我的訓練碼

package nlpParser; 
 

 
import java.io.BufferedOutputStream; 
 
import java.io.FileInputStream; 
 
import java.io.FileOutputStream; 
 
import java.io.IOException; 
 
import java.io.InputStream; 
 
import java.nio.charset.Charset; 
 

 
import opennlp.tools.namefind.NameFinderME; 
 
import opennlp.tools.namefind.NameSample; 
 
import opennlp.tools.namefind.NameSampleDataStream; 
 
import opennlp.tools.namefind.TokenNameFinderFactory; 
 
import opennlp.tools.namefind.TokenNameFinderModel; 
 
import opennlp.tools.util.InputStreamFactory; 
 
import opennlp.tools.util.ObjectStream; 
 
import opennlp.tools.util.PlainTextByLineStream; 
 
import opennlp.tools.util.TrainingParameters; 
 
public class Trainer { 
 
\t // training data set 
 
    static String trainingPath = 
 
    \t \t "C:\\Users\\MujeebulHasan\\Desktop\\Project\\hbtu\\hbtuaiagent\\Source Code\\parser\\training\\"; 
 
    
 
    public static void main(String[] args) throws IOException { 
 

 
    \t String[] entities = new String[]{"rollnumber","result"}; 
 
    \t String[] pathsOfTraingFile = new String[]{"rollnumber\\rollnumber.train","result\\result.train"}; 
 
    \t String[] pathsOfTrainedFile = new String[]{"rollnumber\\rollnumber.bin","result\\result.bin"}; 
 
    \t 
 
    \t for(int i = 0; i < entities.length; i++){ 
 
    \t \t final int j = i; 
 
\t \t  InputStreamFactory isf = new InputStreamFactory() { 
 
\t \t   public InputStream createInputStream() throws IOException { 
 
\t \t    return new FileInputStream(trainingPath+pathsOfTraingFile[j]); 
 
\t \t   } 
 
\t \t  }; 
 
\t \t  Charset charset = Charset.forName("UTF-8"); 
 
\t \t  ObjectStream<String> lineStream = new PlainTextByLineStream(isf, charset); 
 
\t \t  ObjectStream<NameSample> sampleStream = new NameSampleDataStream(lineStream); 
 
\t \t  TokenNameFinderModel model; 
 
\t \t  TokenNameFinderFactory nameFinderFactory = new TokenNameFinderFactory(); 
 
\t \t  try { 
 
\t \t   model = NameFinderME.train("en", entities[i], sampleStream, TrainingParameters.defaultParams(), 
 
\t \t     nameFinderFactory); 
 
\t \t  } finally { 
 
\t \t   sampleStream.close(); 
 
\t \t  } 
 
\t \t  BufferedOutputStream modelOut = null; 
 
\t \t  try { 
 
\t \t   modelOut = new BufferedOutputStream(new FileOutputStream(trainingPath+pathsOfTrainedFile[i])); 
 
\t \t   model.serialize(modelOut); 
 
\t \t  } finally { 
 
\t \t   if (modelOut != null) 
 
\t \t    modelOut.close(); 
 
\t \t  } 
 
    \t } 
 
    } 
 
}

rollnumber.train

result for roll number <START:rollnumber> 1304510020 <END> . 
 
result for roll-number <START:rollnumber> 1304510020 <END> . 
 
result for rollnumber <START:rollnumber> 1304510020 <END> . 
 
result for roll <START:rollnumber> 1304510020 <END> . 
 
result of rollnumber <START:rollnumber> 1304510020 <END> . 
 
result of roll number <START:rollnumber> 1304510020 <END> . 
 
result of roll-number <START:rollnumber> 1304510020 <END> . 
 
result of roll <START:rollnumber> 1304510020 <END> . 
 
roll number <START:rollnumber> 1304510020 <END> result. 
 
rollnumber <START:rollnumber> 1304510020 <END> result. 
 
roll-number <START:rollnumber> 1304510020 <END> result. 
 
roll <START:rollnumber> 1304510020 <END> result. 
 
show result of roll number <START:rollnumber> 1304510020 <END> . 
 
show result of rollnumber <START:rollnumber> 1304510020 <END> . 
 
show result of roll-number <START:rollnumber> 1304510020 <END> . 
 
show result of roll <START:rollnumber> 1304510020 <END> . 
 
show my result for <START:rollnumber> 1304510020 <END> . 
 
result of <START:rollnumber> 1304510020 <END> . 
 
result for <START:rollnumber> 1304510020 <END> . 
 
what is my result for rollnumber <START:rollnumber> 1304510020 <END> . 
 
what is my result of rollnumber <START:rollnumber> 1304510020 <END> . 
 
what is my result for roll <START:rollnumber> 1304510020 <END> .

result.train

<START:result> result <END> for roll number 1304510020. 
 
<START:result> result <END> for roll-number 1304510020. 
 
<START:result> result <END> for rollnumber 1304510020. 
 
<START:result> result <END> of rollnumber 1304510020. 
 
<START:result> result <END> of roll number 1304510020. 
 
<START:result> result <END> of roll-number 1304510020. 
 
roll number 1304510020 <START:result> result <END> . 
 
rollnumber 1304510020 <START:result> result <END> . 
 
roll-number 1304510020 <START:result> result <END> . 
 
show <START:result> result <END> of roll number 1304510020. 
 
show <START:result> result <END> of rollnumber 1304510020. 
 
show <START:result> result <END> of roll-number 1304510020. 
 
show my <START:result> result <END> for 1304510020. 
 
<START:result> result <END> of 1304510020.

當我使用此代碼進行測試。

package nlpParser; 
 

 
import java.io.FileInputStream; 
 
import java.io.IOException; 
 
import java.io.InputStream; 
 
import java.util.Scanner; 
 

 
import opennlp.tools.namefind.NameFinderME; 
 
import opennlp.tools.namefind.TokenNameFinderModel; 
 
import opennlp.tools.util.Span; 
 

 
public class GetEntities { 
 
\t public static void main(String[] args) throws IOException { 
 
\t \t Scanner sc = new Scanner(System.in); 
 
\t \t String query =""; 
 
\t \t GetEntities obj = new GetEntities(); 
 
\t \t while((query = sc.nextLine()) != " "){ 
 
\t \t \t obj.parse(query); 
 
\t \t } 
 
\t \t sc.close(); 
 
\t } 
 
\t public void parse(String query) throws IOException{ 
 
\t \t String[] entities = new String[]{"rollnumber","result"}; 
 
\t \t String[] pathsOfTrainedFile = new String[]{"rollnumber\\rollnumber.bin","result\\result.bin"}; 
 
\t \t  
 
\t \t for(int i = 0 ; i < entities.length; i++){ 
 
\t \t \t //Loading the NER model  
 
\t \t \t InputStream inputStream = new 
 
\t \t \t FileInputStream("C:\\Users\\MujeebulHasan\\Desktop\\Project\\hbtu\\hbtuaiagent\\Source Code\\parser\\training\\"+pathsOfTrainedFile[i]); 
 
\t \t \t TokenNameFinderModel model = new TokenNameFinderModel(inputStream); 
 
\t \t \t //Instantiating the NameFinder class 
 
\t \t \t NameFinderME nameFinder = new NameFinderME(model); 
 
\t  \t  
 
\t \t \t \t //Finding the names in the sentence 
 
\t  \t \t System.out.println("Processing query... "); 
 
\t  \t \t System.out.print("Query = "+query); 
 
\t \t \t \t query = query.replace(".", ""); 
 
\t \t \t \t String[] sentence = query.split(" "); 
 
\t \t \t \t System.out.println(); 
 
\t \t \t \t System.out.println("RESULT :"); 
 
\t \t \t \t Span nameSpans[] = nameFinder.find(sentence); 
 
\t \t \t \t //Printing the spans of the names in the sentence 
 
\t \t \t \t for(Span s: nameSpans) { 
 
\t \t \t \t \t System.out.println(s.toString()); 
 
\t \t \t \t \t System.out.println(sentence[s.getStart()]); 
 
\t \t \t \t } 
 
\t \t \t } 
 
\t \t } 
 
}

它提供了以下結果。有時候哪個是錯的。

result of roll number 1304510020 
 
Processing query... 
 
Query = result of roll number 1304510020 
 
RESULT : 
 
Processing query... 
 
Query = result of roll number 1304510020 
 
RESULT : 
 
[0..1) result 
 
result 
 
show result for roll number 1304510020 
 
Processing query... 
 
Query = show result for roll number 1304510020 
 
RESULT : 
 
Processing query... 
 
Query = show result for roll number 1304510020 
 
RESULT : 
 
[1..2) result 
 
result 
 
result for rollnumber 1304510020 
 
Processing query... 
 
Query = result for rollnumber 1304510020 
 
RESULT : 
 
[3..4) rollnumber 
 
1304510020 
 
Processing query... 
 
Query = result for rollnumber 1304510020 
 
RESULT : 
 
[0..1) result 
 
result 
 
result 1304510020 
 
Processing query... 
 
Query = result 1304510020 
 
RESULT : 
 
Processing query... 
 
Query = result 1304510020 
 
RESULT : 
 
[0..1) result 
 
result 
 
1304510020 result 
 
Processing query... 
 
Query = 1304510020 result 
 
RESULT : 
 
Processing query... 
 
Query = 1304510020 result 
 
RESULT : 
 
[1..2) result 
 
result

回答

0

出現這種情況。由於你的訓練數據的大小。根據OpenNLP文檔,您必須在訓練數據中有大約15,000行,才能獲得良好的結果。

如果您沒有足夠的數據,您可以簡單地在您的案例中使用正則表達式,這是所有這一切都更容易。

如果您願意製作更大的訓練數據集,您可以按照this或再次使用RegEX標記您的超大型語料庫。

希望這會有所幫助!

相關問題