2011-09-21 54 views
1

我想讀的部分文件夾中的內容數據是如何產生的段文件夾的內容數據。我認爲內容數據文件是自定義的format閱讀Nutch的用java

我嘗試過使用nutch的Content類,但它不能識別格式。

回答

0
org.apache.nutch.segment.SegmentReader 

有一個map reduction實現,用於讀取segment目錄中的內容數據。

5
import java.io.IOException; 

import org.apache.commons.cli.Options; 
import org.apache.hadoop.conf.Configuration; 
import org.apache.hadoop.fs.FileSystem; 
import org.apache.hadoop.fs.Path; 
import org.apache.hadoop.io.SequenceFile; 
import org.apache.hadoop.io.Text; 
import org.apache.hadoop.util.GenericOptionsParser; 
import org.apache.nutch.protocol.Content; 
import org.apache.nutch.util.NutchConfiguration; 

public class ContentReader { 
    public static void main(String[] args) throws IOException { 
     // Setup the parser 
     Configuration conf = NutchConfiguration.create(); 
     Options opts = new Options(); 
     GenericOptionsParser parser = new GenericOptionsParser(conf, opts, args); 
     String[] remainingArgs = parser.getRemainingArgs(); 
     FileSystem fs = FileSystem.get(conf); 
     String segment = remainingArgs[0]; 
     Path file = new Path(segment, Content.DIR_NAME + "/part-00000/data"); 
     SequenceFile.Reader reader = new SequenceFile.Reader(fs, file, conf); 
     Text key = new Text(); 
     Content content = new Content(); 
     // Loop through sequence files 
     while (reader.next(key, content)) { 
      try { 
       System.out.write(content.getContent(), 0, 
         content.getContent().length); 
      } catch (Exception e) { 
      } 
     } 
    } 
} 
+0

感謝您對以上!任何有助於檢索給定文件類型(docx,pdf等)的方法。 – change

+0

String contentType = content.getContentType(); \t \t \t \t \t if(!contentType.equalsIgnoreCase(「application/pdf」)){ – kitwalker

+0

真棒!謝謝! argv代表的論點和順序又是什麼? – change