斯坦福corenlp暫停和繼續註釋管道

通常，當您使用的發言權corenlp註釋管道NER你會寫下面的代碼斯坦福corenlp暫停和繼續註釋管道

Properties props = new Properties(); 
props.put("annotators", "tokenize, ssplit, pos, lemma, ner"); 
StanfordCoreNLP pipeline = new StanfordCoreNLP(props); 
pipeline.annotate(document);

我想在上面的管道進行句子拆分，即ssplit。但之後我想刪除太長的句子，然後繼續其餘的管道。我一直在做的是分句，按長度過濾句子，然後通過應用整個流水線執行NER，即tokenize, ssplit, pos, lemma, ner。所以基本上我已經執行了兩次tokenize和ssplit。有沒有更有效的方法來做到這一點？例如，執行tokenize和ssplit，然後暫停管道以刪除過長的句子，然後用pos，lemma和ner恢復管道。

來源

2015-12-15 user1893354

您可以創建兩個管道對象，第二個管道對象採用後面的註釋器。所以：

Properties props = new Properties(); 
props.put("annotators", "tokenize, ssplit"); 
StanfordCoreNLP pipeline = new StanfordCoreNLP(props); 
pipeline.annotate(document);

通過如下：

Properties props = new Properties(); 
props.put("annotators", "pos, lemma, ner"); 
StanfordCoreNLP pipeline = new StanfordCoreNLP(props, false); 
pipeline.annotate(document);

注意，當然，一些註釋（例如，字符偏移量），如果你刪除中間的句子會不直觀。

來源

2015-12-15 07:29:24

你能解釋兩個管道中每一個的文檔變量是什麼嗎？如果它們都是字符串，那麼第二個管道也不需要標記化。沒有標記就不能做標記，對吧？ – user1893354

好吧，現在我看到'''pipeline.annotate（document）'''改變了文檔。現在我需要一種方法來改變兩個管道之間的''''''文件'以按照長度過濾句子。 – user1893354

這是正確的。 'document'是一個'Annotation'對象，通過流水線進行適當的變異。 –

斯坦福corenlp暫停和繼續註釋管道

回答

相關問題