Hadoop：提供目錄作爲MapReduce作業的輸入

我正在使用Cloudera Hadoop。我能夠運行簡單的mapreduce程序，我提供一個文件作爲MapReduce程序的輸入。Hadoop：提供目錄作爲MapReduce作業的輸入

此文件包含要由映射器函數處理的所有其他文件。

但是，我被困在一個點上。

/folder1 
    - file1.txt 
    - file2.txt 
    - file3.txt

我怎麼能指定輸入路徑的MapReduce程序爲"/folder1"，使之可以開始處理該目錄內的每個文件？

任何想法？

編輯：

1）Intiailly，我提供的inputFile.txt作爲輸入到映射精簡程序。它工作完美。

>inputFile.txt 
file1.txt 
file2.txt 
file3.txt

2）但是現在，我不想給一個輸入文件，我想在命令行上提供一個輸入目錄作爲arg [0]。

hadoop jar ABC.jar /folder1 /output

來源

2013-11-20 Javascript is GOD

你是如何提交/創建工作嗎？ –

檢查編輯..... –

是的，就是這樣工作，你的問題是什麼？ –

你可以使用FileSystem.listStatus獲得從給定目錄的文件列表，代碼可能是如下：

//get the FileSystem, you will need to initialize it properly 
FileSystem fs= FileSystem.get(conf); 
//get the FileStatus list from given dir 
FileStatus[] status_list = fs.listStatus(new Path(args[0])); 
if(status_list != null){ 
    for(FileStatus status : status_list){ 
     //add each file to the list of inputs for the map-reduce job 
     FileInputFormat.addInputPath(conf, status.getPath()); 
    } 
}

來源

2013-11-20 13:14:12 zhutoulala

添加路徑後，我們如何在地圖作業中訪問它？它會直接返回文件的內容嗎？ –

問題是FileInputFormat不輸入路徑目錄遞歸讀取文件。

解決方案：使用以下代碼

FileInputFormat.setInputDirRecursive(job, true);在地圖下方前行減少代碼

FileInputFormat.addInputPath(job, new Path(args[0]));

您可以檢查here爲此版本，它是固定的。

來源

2014-05-28 09:33:50 shashaDenovo

可以使用HDFS wildcards以提供多個文件，

因此，解決方法：

hadoop jar ABC.jar /folder1/* /output

或

hadoop jar ABC.jar /folder1/*.txt /output

來源

2015-11-07 11:02:32 Dmitry

使用MultipleInputs類。

MultipleInputs. addInputPath(Job job, Path path, Class<? extends InputFormat> 
inputFormatClass, Class<? extends Mapper> mapperClass)

看一看工作code

來源

2016-01-07 15:27:20

Hadoop：提供目錄作爲MapReduce作業的輸入

回答

相關問題