Java：Apache POI：我可以從MS Word（.doc）文件中獲得乾淨的文本嗎？

使用Apache POI時，我（以編程方式）從MS Word文件獲取的字符串與使用MS Word打開文件時可以查看的文字不同。Java：Apache POI：我可以從MS Word（.doc）文件中獲得乾淨的文本嗎？

當使用以下代碼：

File someFile = new File("some\\path\\MSWFile.doc"); 
InputStream inputStrm = new FileInputStream(someFile); 
HWPFDocument wordDoc = new HWPFDocument(inputStrm); 
System.out.println(wordDoc.getText());

輸出是與許多「無效」字符的單個線（是的，「方框」），和許多不想要的字符串，如「FORMTEXT」，「HYPERLINK \l "_Toc##########" 「（‘＃’是個數字），」 PAGEREF _Toc########## \h 4」等

下面的代碼‘修復’單行的問題，而是維護所有無效字符和不需要的文本：

File someFile = new File("some\\path\\MSWFile.doc"); 
InputStream inputStrm = new FileInputStream(someFile); 
WordExtractor wordExtractor = new WordExtractor(inputStrm); 
for(String paragraph:wordExtractor.getParagraphText()){ 
    System.out.println(paragraph); 
}

我不知道我是否使用了錯誤的方法來提取文本，但這就是我在看POI's quick-guide時想到的。如果我是，那麼正確的方法是什麼？

如果輸出是正確的，是否有擺脫不需要的文本的標準方式，還是我必須寫我自己的過濾器？

來源

2012-04-20 XenoRo

有兩個選項，一個直接在Apache POI中提供，另一個通過Apache Tika（在內部使用Apache POI）提供。

第一個選項是使用WordExtractor，但在調用時將其包裝在stripFields(String)的調用中。這將刪除文本中包含的基於文本的字段，例如您看過的HYPERLINK。您的代碼將變爲：

NPOIFSFileSystem fs = new NPOIFSFileSytem(file); 
WordExtractor extractor = new WordExtractor(fs.getRoot()); 

for(String rawText : extractor.getParagraphText()) { 
String text = extractor.stripFields(rawText); 
System.out.println(text); 
}

另一個選項是使用Apache Tika。 Tika爲各種文件提供文本提取和元數據，因此相同的代碼可以用於.doc，.docx，.pdf和其他許多文件。爲了讓您的Word文檔的乾淨，純文本（你也可以XHTML如果您願意），你會做這樣的事情：

TikaConfig tika = TikaConfig.getDefaultConfig(); 
TikaInputStream stream = TikaInputStream.get(file); 
ContentHandler handler = new BodyContentHandler(); 
Metadata metadata = new Metadata(); 
tika.getParser().parse(input, handler, metadata, new ParseContext()); 
String text = handler.toString();

來源

2012-04-22 18:56:21 Gagravarr

段落第二種解決方案沒有工作，在使用它我的測試。 TIKA-1.2從.doc文件返回FORMCHECKBOX和其他內容。 .docx文件雖然工作正常。 – Simon 2013-02-07 14:55:53

我建議你試試最新的Tika版本1.3。如果問題仍然存在，請[提出錯誤]（https://issues.apache.org/jira/browse/TIKA）並上傳示例文件，以便我們調查！ – Gagravarr 2013-02-07 15:15:01

對於我來說，這仍然發生在Tika 1.3中，因爲它值得。 – damd 2013-02-22 16:16:38

這個類可以讀取在Java中都doc和docx文件。對於這個我使用蒂卡-APP-1.2.jar：

/* 
* This class is used to read .doc and .docx files 
* 
* @author Developer 
* 
*/ 

import java.io.ByteArrayOutputStream; 
import java.io.File; 
import java.io.InputStream; 
import java.io.OutputStream; 
import java.io.OutputStreamWriter; 
import java.net.URL; 
import org.apache.tika.detect.DefaultDetector; 
import org.apache.tika.detect.Detector; 
import org.apache.tika.io.TikaInputStream; 
import org.apache.tika.metadata.Metadata; 
import org.apache.tika.parser.AutoDetectParser; 
import org.apache.tika.parser.ParseContext; 
import org.apache.tika.parser.Parser; 
import org.apache.tika.sax.BodyContentHandler; 
import org.xml.sax.ContentHandler; 

class TextExtractor { 
    private OutputStream outputstream; 
    private ParseContext context; 
    private Detector detector; 
    private Parser parser; 
    private Metadata metadata; 
    private String extractedText; 

    public TextExtractor() { 
     context = new ParseContext(); 
     detector = new DefaultDetector(); 
     parser = new AutoDetectParser(detector); 
     context.set(Parser.class, parser); 
     outputstream = new ByteArrayOutputStream(); 
     metadata = new Metadata(); 
    } 

    public void process(String filename) throws Exception { 
     URL url; 
     File file = new File(filename); 
     if (file.isFile()) { 
      url = file.toURI().toURL(); 
     } else { 
      url = new URL(filename); 
     } 
     InputStream input = TikaInputStream.get(url, metadata); 
     ContentHandler handler = new BodyContentHandler(outputstream); 
     parser.parse(input, handler, metadata, context); 
     input.close(); 
    } 

    public void getString() { 
     //Get the text into a String object 
     extractedText = outputstream.toString(); 
     //Do whatever you want with this String object. 
     System.out.println(extractedText); 
    } 

    public static void main(String args[]) throws Exception { 
     if (args.length == 1) { 
      TextExtractor textExtractor = new TextExtractor(); 
      textExtractor.process(args[0]); 
      textExtractor.getString(); 
     } else { 
      throw new Exception(); 
     } 
    } 
}

編譯：

javac -cp ".:tika-app-1.2.jar" TextExtractor.java

運行：

java -cp ".:tika-app-1.2.jar" TextExtractor SomeWordDocument.doc

來源

2012-08-17 08:53:43 Vyas

試試這個，對我的作品和純粹一個POI解決方案。儘管如此，你將不得不尋找HWPFDocument對應。確保您正在閱讀的文檔早於Word 97，否則像我一樣使用XWPFDocument。

InputStream inputstream = new FileInputStream(m_filepath); 
//read the file 
XWPFDocument adoc= new XWPFDocument(inputstream); 
//and place it in a xwpf format 

aString = new XWPFWordExtractor(adoc).getText();   
//gets the full text

現在如果你想某些部分可以使用getparagraphtext但不使用文本提取，直接就這樣

for (XWPFParagraph p : adoc.getParagraphs()) 
{ 
    System.out.println(p.getParagraphText()); 
}

來源

2014-11-04 10:16:32

Java：Apache POI：我可以從MS Word（.doc）文件中獲得乾淨的文本嗎？

回答

相關問題