2013-03-23 114 views
1

我用下面的一些代碼段從.doc文件中提取文本如何使用apache poi從.doc文件中提取文本?

HWPFDocument document = new HWPFDocument(new FileInputStream(inputFile)); 
Range range = document.getRange(); 
     int len = range.numParagraphs(); 
     StringBuilder builder = new StringBuilder(); 

     for (int i = 0; i < len; i++) { 
      builder.append(range.getParagraph(i).text()); 
     } 

HWPFDocument document = new HWPFDocument(new FileInputStream(inputFile)); 
WordExtractor wordExtractor = new WordExtractor(document); 
     String[] paragraphs = wordExtractor.getParagraphText(); 
     StringBuilder builder = new StringBuilder(); 
     for (String p : paragraphs) { 
      builder.append(p); 
     } 

然而,他們兩人總是輸出一些奇怪的字符。例如:PAGEREF_Toc351848910\h10HYPERLINK\l_Toc351848911CITATIONPla\l1033[HYPERLINK\l"Pla"13]。所以,我想知道從哪裏他們是如何從.doc文件中提取文本提前

+1

您顯示的*奇怪*文本是一個目錄輸入一個TOC參考和一個引文。對不起,我不知道如何刪除它們。 – grahamj42 2013-03-23 20:45:12

+1

您是否嘗試過使用[WordExtractor#stripFields(String)](http://poi.apache.org/apidocs/org/apache/poi/hwpf/extractor/WordExtractor.html#stripFields(java.lang.String))刪除它們? – Gagravarr 2013-03-24 21:09:18

+0

它的工作原理。非常感謝 – thoitbk 2013-03-28 17:55:28

回答

0

感謝時將其刪除,我希望這可以給你一些啓示。

private static void ConvertDoctoPdf(String src, String outputPdf) throws Exception { 

     try { 
      Document pdfdoc = new Document(); 

      HWPFDocument doc = new HWPFDocument(new FileInputStream(src)); 

      //create wordextractor object to wrap the extracted word from HWPFDocument object. 
      WordExtractor we = new WordExtractor(doc); 

      OutputStream outputFile = new FileOutputStream(new File(desc)); 

      //create a pdf writer object to write text to mypdf.pdf file 
      PdfWriter.getInstance(pdfdoc, outputFile); 

      pdfdoc.open(); 

      Paragraph para = new Paragraph(); 

      //Collecting all paragraphs 
      String[] paragraphs = we.getParagraphText(); 

      for (int i = 0; i < paragraphs.length; i++) { 
       //add the paragraph to the document 
       para.add(paragraphs[i]); 
       //para.add(new Chunk(Chunk.NEWLINE)); 
       } 
      //print all paragraph together 
      System.out.println(para);  
      //Add all paragraph together to pdfdoc document. 
      pdfdoc.add(para); 

      pdfdoc.close(); 
      we.close(); 
      } catch (Exception e) { 
      e.printStackTrace(); 

     } 
    } 
+0

這似乎是創建一個PDF文檔 - 如何以任何方式解決原始問題? – Gagravarr 2017-02-16 11:56:15

+0

'''System.out.println(para); ''' 它打印提取的段落。 – 2017-02-17 04:30:15

相關問題