我用下面的一些代碼段從.doc文件中提取文本如何使用apache poi從.doc文件中提取文本?
HWPFDocument document = new HWPFDocument(new FileInputStream(inputFile));
Range range = document.getRange();
int len = range.numParagraphs();
StringBuilder builder = new StringBuilder();
for (int i = 0; i < len; i++) {
builder.append(range.getParagraph(i).text());
}
和
HWPFDocument document = new HWPFDocument(new FileInputStream(inputFile));
WordExtractor wordExtractor = new WordExtractor(document);
String[] paragraphs = wordExtractor.getParagraphText();
StringBuilder builder = new StringBuilder();
for (String p : paragraphs) {
builder.append(p);
}
然而,他們兩人總是輸出一些奇怪的字符。例如:PAGEREF_Toc351848910\h10HYPERLINK\l
_Toc351848911
CITATIONPla\l1033[HYPERLINK\l"Pla"13]
。所以,我想知道從哪裏他們是如何從.doc文件中提取文本提前
您顯示的*奇怪*文本是一個目錄輸入一個TOC參考和一個引文。對不起,我不知道如何刪除它們。 – grahamj42 2013-03-23 20:45:12
您是否嘗試過使用[WordExtractor#stripFields(String)](http://poi.apache.org/apidocs/org/apache/poi/hwpf/extractor/WordExtractor.html#stripFields(java.lang.String))刪除它們? – Gagravarr 2013-03-24 21:09:18
它的工作原理。非常感謝 – thoitbk 2013-03-28 17:55:28