2017-10-12 277 views
1

我正在嘗試向nl-personTest.bin文件添加額外的訓練數據,其中OpenNLP。 現在是我的問題,當我運行我的代碼添加額外的訓練數據時,它將刪除已經存在的數據並只添加我的新數據。將訓練數據添加到現有模型(bin文件)

如何添加額外的訓練數據而不是替換它?

我沒有使用下面的代碼,(得到它來自Open NLP NER is not properly trained

public class TrainNames 
    { 
    public static void main(String[] args) 
    { 
     train("nl", "person", "namen.txt", "nl-ner-personTest.bin"); 
    } 

    public static String train(String lang, String entity,InputStreamFactory inputStream, FileOutputStream modelStream) { 

     Charset charset = Charset.forName("UTF-8"); 
     TokenNameFinderModel model = null; 
     ObjectStream<NameSample> sampleStream = null; 
     try { 
      ObjectStream<String> lineStream = new PlainTextByLineStream(inputStream, charset); 
      sampleStream = new NameSampleDataStream(lineStream); 
      TokenNameFinderFactory nameFinderFactory = new TokenNameFinderFactory(); 
      model = NameFinderME.train("nl", "person", sampleStream, TrainingParameters.defaultParams(), 
       nameFinderFactory); 
     } catch (FileNotFoundException fio) { 

     } catch (IOException io) { 

     } finally { 
      try { 
       sampleStream.close(); 
      } catch (IOException io) { 

      } 
     } 
     BufferedOutputStream modelOut = null; 
     try { 
      modelOut = new BufferedOutputStream(modelStream); 
      model.serialize(modelOut); 
     } catch (IOException io) { 

     } finally { 
      if (modelOut != null) { 
       try { 
        modelOut.close(); 
       } catch (IOException io) { 

       } 
      } 
     } 
     return "Something goes wrong with training module."; 
    } 

    public static String train(String lang, String entity, String taggedCoprusFile, 
           String modelFile) { 
     try { 
      InputStreamFactory inputStream = new InputStreamFactory() { 
       FileInputStream fileInputStream = new FileInputStream("namen.txt"); 

       public InputStream createInputStream() throws IOException { 
        return fileInputStream; 
       } 
      }; 

      return train(lang, entity, inputStream, 
       new FileOutputStream(modelFile)); 
     } catch (Exception e) { 
      e.printStackTrace(); 
     } 
     return "Something goes wrong with training module."; 
    } } 

任何人任何想法來解決這個問題?

因爲如果我想有一個準確的訓練集,我需要至少有15K 句子說文檔。

回答

0

我認爲OpenNLP不支持擴展現有的二進制NLP模型。

如果您有所有可用的培訓數據,請將它們全部收集起來,然後立即進行培訓。您可以使用SequenceInputStream。我修改您的示例使用另一個InputStreamFactory

public String train(String lang, String entity, InputStreamFactory inputStream, FileOutputStream modelStream) { 

    // .... 
    try { 
     ObjectStream<String> lineStream = new PlainTextByLineStream(trainingDataInputStreamFactory(Arrays.asList(
       new File("trainingdata1.txt"), 
       new File("trainingdata2.txt"), 
       new File("trainingdata3.txt") 
     )), charset); 

     // ... 
    } 

    // ... 
} 

private InputStreamFactory trainingDataInputStreamFactory(List<File> trainingFiles) { 
    return new InputStreamFactory() { 
     @Override 
     public InputStream createInputStream() throws IOException { 
      List<InputStream> inputStreams = trainingFiles.stream() 
        .map(f -> { 
         try { 
          return new FileInputStream(f); 
         } catch (FileNotFoundException e) { 
          e.printStackTrace(); 
          return null; 
         } 
        }) 
        .filter(Objects::nonNull) 
        .collect(Collectors.toList()); 

      return new SequenceInputStream(new Vector<>(inputStreams).elements()); 
     } 
    }; 
} 
+0

感謝@Schrieveslaach – Patrick

+1

@Patrick,只爲您的信息:我正在開發一個工具集,它可以幫助您從標註的語料庫創建NLP模型。請看看[這裏](https://git.noc.fh-aachen.de/marc.schreiber/Towards-Effective-NLP-Application-Development),如果您有任何問題,請告訴我。 ;-) – Schrieveslaach

+0

謝謝,我會看看它。@ Schrieveslaach – Patrick