將訓練數據添加到現有模型（bin文件）

我正在嘗試向nl-personTest.bin文件添加額外的訓練數據，其中OpenNLP。現在是我的問題，當我運行我的代碼添加額外的訓練數據時，它將刪除已經存在的數據並只添加我的新數據。將訓練數據添加到現有模型（bin文件）

如何添加額外的訓練數據而不是替換它？

我沒有使用下面的代碼，（得到它來自Open NLP NER is not properly trained）

public class TrainNames 
    { 
    public static void main(String[] args) 
    { 
     train("nl", "person", "namen.txt", "nl-ner-personTest.bin"); 
    } 

    public static String train(String lang, String entity,InputStreamFactory inputStream, FileOutputStream modelStream) { 

     Charset charset = Charset.forName("UTF-8"); 
     TokenNameFinderModel model = null; 
     ObjectStream<NameSample> sampleStream = null; 
     try { 
      ObjectStream<String> lineStream = new PlainTextByLineStream(inputStream, charset); 
      sampleStream = new NameSampleDataStream(lineStream); 
      TokenNameFinderFactory nameFinderFactory = new TokenNameFinderFactory(); 
      model = NameFinderME.train("nl", "person", sampleStream, TrainingParameters.defaultParams(), 
       nameFinderFactory); 
     } catch (FileNotFoundException fio) { 

     } catch (IOException io) { 

     } finally { 
      try { 
       sampleStream.close(); 
      } catch (IOException io) { 

      } 
     } 
     BufferedOutputStream modelOut = null; 
     try { 
      modelOut = new BufferedOutputStream(modelStream); 
      model.serialize(modelOut); 
     } catch (IOException io) { 

     } finally { 
      if (modelOut != null) { 
       try { 
        modelOut.close(); 
       } catch (IOException io) { 

       } 
      } 
     } 
     return "Something goes wrong with training module."; 
    } 

    public static String train(String lang, String entity, String taggedCoprusFile, 
           String modelFile) { 
     try { 
      InputStreamFactory inputStream = new InputStreamFactory() { 
       FileInputStream fileInputStream = new FileInputStream("namen.txt"); 

       public InputStream createInputStream() throws IOException { 
        return fileInputStream; 
       } 
      }; 

      return train(lang, entity, inputStream, 
       new FileOutputStream(modelFile)); 
     } catch (Exception e) { 
      e.printStackTrace(); 
     } 
     return "Something goes wrong with training module."; 
    } }

任何人任何想法來解決這個問題？

因爲如果我想有一個準確的訓練集，我需要至少有15K 句子說文檔。

來源

2017-10-12 Patrick

我認爲OpenNLP不支持擴展現有的二進制NLP模型。

如果您有所有可用的培訓數據，請將它們全部收集起來，然後立即進行培訓。您可以使用SequenceInputStream。我修改您的示例使用另一個InputStreamFactory

public String train(String lang, String entity, InputStreamFactory inputStream, FileOutputStream modelStream) { 

    // .... 
    try { 
     ObjectStream<String> lineStream = new PlainTextByLineStream(trainingDataInputStreamFactory(Arrays.asList(
       new File("trainingdata1.txt"), 
       new File("trainingdata2.txt"), 
       new File("trainingdata3.txt") 
     )), charset); 

     // ... 
    } 

    // ... 
} 

private InputStreamFactory trainingDataInputStreamFactory(List<File> trainingFiles) { 
    return new InputStreamFactory() { 
     @Override 
     public InputStream createInputStream() throws IOException { 
      List<InputStream> inputStreams = trainingFiles.stream() 
        .map(f -> { 
         try { 
          return new FileInputStream(f); 
         } catch (FileNotFoundException e) { 
          e.printStackTrace(); 
          return null; 
         } 
        }) 
        .filter(Objects::nonNull) 
        .collect(Collectors.toList()); 

      return new SequenceInputStream(new Vector<>(inputStreams).elements()); 
     } 
    }; 
}

來源

2017-10-14 06:38:57 Schrieveslaach

感謝@Schrieveslaach – Patrick

@Patrick，只爲您的信息：我正在開發一個工具集，它可以幫助您從標註的語料庫創建NLP模型。請看看[這裏]（https://git.noc.fh-aachen.de/marc.schreiber/Towards-Effective-NLP-Application-Development），如果您有任何問題，請告訴我。 ;-) – Schrieveslaach

謝謝，我會看看它。@ Schrieveslaach – Patrick

將訓練數據添加到現有模型（bin文件）

回答

相關問題