2011-09-20 178 views
1

我想使用Apache Tika解析使用ByteArrayInputStream的二進制文件的PDF文件...並開始獲取某些PDF文件的錯誤,對於一些它解析得非常好.. 早些時候我能夠使用Tika解析相同的pdf文件,但是現在當我嘗試使用ByteArrayInputStream時,我開始出現錯誤..我認爲ByteArray存在一些問題這是我正在獲取的錯誤..分析二進制文件時出現錯誤...(主要是PDF)

org.apache.tika.exception.TikaException: Unexpected RuntimeException from [email protected] 

這是我的代碼...

if (page.isBinary()) { 
    handleBinary(page, curURL); 
} 

public int handleBinary(Page page, WebURL curURL) { 
    try { 
      binaryParser.parse(page.getBinaryData()); 
      page.setText(binaryParser.getText()); 
      handleMetaData(page, binaryParser.getMetaData()); 


      //System.out.println(" pdf url " +page.getWebURL().getURL()); 
      //System.out.println("Text" +page.getText()); 
    } catch (Exception e) { 
      // TODO: handle exception 
    } 
      return PROCESS_OK; 
} 

 public class BinaryParser { 

      private String text; 
      private Map<String, String> metaData; 

      private Tika tika; 

      public BinaryParser() { 
       tika = new Tika(); 
      } 

      public void parse(byte[] data) { 
       InputStream is = null; 
       try { 
        is = new ByteArrayInputStream(data); 
        text = null; 
        Metadata md = new Metadata(); 
        metaData = new HashMap<String, String>(); 
        text = tika.parseToString(is, md).trim(); 
        processMetaData(md); 
       } catch (Exception e) { 
        e.printStackTrace(); 
       } finally { 
        IOUtils.closeQuietly(is); 
       } 
      } 

      public String getText() { 
       return text; 
      } 

      public void setText(String text) { 
       this.text = text; 
      } 


      private void processMetaData(Metadata md){ 
       if ((getMetaData() == null) || (!getMetaData().isEmpty())) { 
        setMetaData(new HashMap<String, String>()); 
       } 
       for (String name : md.names()){ 
        getMetaData().put(name.toLowerCase(), md.get(name)); 
       } 
      } 

      public Map<String, String> getMetaData() { 
       return metaData; 
      } 

      public void setMetaData(Map<String, String> metaData) { 
       this.metaData = metaData; 
      } 

     } 

public class Page { 

     private WebURL url; 

     private String html; 

     // Data for textual content 
     private String text; 

     private String title; 

     private String keywords; 
     private String authors; 
     private String description; 
     private String contentType; 
     private String contentEncoding; 

     private byte[] binaryData; 

     private List<WebURL> urls; 

     private ByteBuffer bBuf; 

     private final static String defaultEncoding = Configurations 
       .getStringProperty("crawler.default_encoding", "UTF-8"); 

     public boolean load(final InputStream in, final int totalsize, 
       final boolean isBinary) { 
      if (totalsize > 0) { 
       this.bBuf = ByteBuffer.allocate(totalsize + 1024); 
      } else { 
       this.bBuf = ByteBuffer.allocate(PageFetcher.MAX_DOWNLOAD_SIZE); 
      } 
      final byte[] b = new byte[1024]; 
      int len; 
      double finished = 0; 
      try { 
       while ((len = in.read(b)) != -1) { 
        if (finished + b.length > this.bBuf.capacity()) { 
         break; 
        } 
        this.bBuf.put(b, 0, len); 
        finished += len; 
       } 
      } catch (final BufferOverflowException boe) { 
       System.out.println("Page size exceeds maximum allowed."); 
       return false; 
      } catch (final Exception e) { 
       System.err.println(e.getMessage()); 
       return false; 
      } 

      this.bBuf.flip(); 
      if (isBinary) { 
       binaryData = new byte[bBuf.limit()]; 
       bBuf.get(binaryData); 
      } else { 
       this.html = ""; 
       this.html += Charset.forName(defaultEncoding).decode(this.bBuf); 
       this.bBuf.clear(); 
       if (this.html.length() == 0) { 
        return false; 
       } 
      } 
      return true; 
     } 
    public boolean isBinary() { 
     return binaryData != null; 
    } 

    public byte[] getBinaryData() { 
     return binaryData; 
    } 

任何建議自己做錯了什麼,我做什麼......!

更新: - 升級到PDFBOX 1.6.0版本後,我開始收到此錯誤一些PDF ...

Parsing Error, Skipping Object 
java.io.IOException: expected='endstream' actual='' [email protected] 
    at org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(BaseParser.java:439) 
    at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:552) 
    at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:184) 
    at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1088) 
    at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1053) 

而對於一些PDF這個錯誤...

Did not found XRef object at specified startxref position 0 
Invalid dictionary, found: '' but expected: '/' 
WARN [Crawler 2] Did not found XRef object at specified startxref position 0 
+0

我永遠不會期望從任何API的NullPointerException,除非它是javadoc這樣說。你有沒有檢查過,這不是一個錯誤? – Kashyap

回答

1

這是一個已知的bug PDFBox版本1.4.0。只需更新至PDFBox 1.5.0+

入住此release notes

[PDFBOX-578] NPE的NullPointerException在PDPa​​geNode.getCount

JIRA ticket

+0

感謝您的答案..並且當我更新到1.6.0 PDFBox版本...我現在得到新錯誤...我已更新問題... – ferhan

+0

這可能是另一個PDFBox問題,但我會傾斜更多的是針對該錯誤的損壞的PDF。你是用所有的PDF獲得它,還是隻有一兩個? – Gagravarr

+0

@Gagravarr我得到那幾個pdf ...但我可以打開這些pd​​f的...他們沒有損壞... – ferhan

相關問題