閱讀Nutch的用java

我想讀的部分文件夾中的內容數據是如何產生的段文件夾的內容數據。我認爲內容數據文件是自定義的format 閱讀Nutch的用java

我嘗試過使用nutch的Content類，但它不能識別格式。

2011-09-21 surajz

org.apache.nutch.segment.SegmentReader

有一個map reduction實現，用於讀取segment目錄中的內容數據。

來源

2011-09-22 03:39:46 surajz

import java.io.IOException; 

import org.apache.commons.cli.Options; 
import org.apache.hadoop.conf.Configuration; 
import org.apache.hadoop.fs.FileSystem; 
import org.apache.hadoop.fs.Path; 
import org.apache.hadoop.io.SequenceFile; 
import org.apache.hadoop.io.Text; 
import org.apache.hadoop.util.GenericOptionsParser; 
import org.apache.nutch.protocol.Content; 
import org.apache.nutch.util.NutchConfiguration; 

public class ContentReader { 
    public static void main(String[] args) throws IOException { 
     // Setup the parser 
     Configuration conf = NutchConfiguration.create(); 
     Options opts = new Options(); 
     GenericOptionsParser parser = new GenericOptionsParser(conf, opts, args); 
     String[] remainingArgs = parser.getRemainingArgs(); 
     FileSystem fs = FileSystem.get(conf); 
     String segment = remainingArgs[0]; 
     Path file = new Path(segment, Content.DIR_NAME + "/part-00000/data"); 
     SequenceFile.Reader reader = new SequenceFile.Reader(fs, file, conf); 
     Text key = new Text(); 
     Content content = new Content(); 
     // Loop through sequence files 
     while (reader.next(key, content)) { 
      try { 
       System.out.write(content.getContent(), 0, 
         content.getContent().length); 
      } catch (Exception e) { 
      } 
     } 
    } 
}

來源

2013-04-02 12:21:07 kitwalker

感謝您對以上！任何有助於檢索給定文件類型（docx，pdf等）的方法。 – change

String contentType = content.getContentType（）; \t \t \t \t \t if（！contentType.equalsIgnoreCase（「application/pdf」））{ – kitwalker

真棒！謝謝！ argv代表的論點和順序又是什麼？ – change

閱讀Nutch的用java

回答

相關問題