2015-09-07 111 views
0

我想了解如何從豬腳本中集成調用mapreduce作業。如何在豬腳本中運行Mapreduce

我提到的鏈接 https://wiki.apache.org/pig/NativeMapReduce

但我不知道如何做到這一點,因爲它會怎麼理解這是我的映射或減速的代碼。解釋不是很清楚。

如果有人可以用一個例子來說明它,它將會非常有幫助。

由於提前, 乾杯:)從pig documentation

A = LOAD 'WordcountInput.txt'; 
B = MAPREDUCE 'wordcount.jar' STORE A INTO 'inputDir' LOAD 'outputDir' 
    AS (word:chararray, count: int) `org.myorg.WordCount inputDir outputDir`; 

回答

3

實施例在上述例子中,豬從A輸入數據存儲到inputDir和從outputDir加載作業的輸出數據。

此外,還有在HDFS一個罐子叫wordcount.jar其中有一個與主類需要設置映射器和減速器,輸入和輸出等

你也可以叫映射精簡的護理稱爲org.myorg.WordCount類通過hadoop jar mymr.jar org.myorg.WordCount inputDir outputDir工作。

+0

@Fred嗨,有一個問題,我是能夠執行MR工作,但大BIG查看問題在於它首先將輸入複製到文件夾「inputDir」,然後才執行MapReduce作業(這裏是Wordcount.jar)。因此,複製大數據會很耗時,效率也不高。你能建議一個替代不復制數據,仍然使用MapReduce代碼? –

+0

我不確定STORE A INTO'inputDir'是否是強制性的。如果不是,就跳過它。如果是這樣,只需將一些小型虛擬數據複製到某個位置,但是從您的mapreduce程序中的真實/大型輸入中讀取。 – Frederic

+0

謝謝@Fred,它解決了我的問題,儘管我無法避免存儲功能。想知道如果我已經實現了我自己的豬閱讀器並使用加載命令通過該加載器讀取數據,那麼這種技術可能不起作用,如果可以找到使用Pig Loader向Mapreduce提供數據的任何替代方式,那麼它將是獎金。謝謝您的幫助。!! –

0

默認豬會預測地圖/減少計劃。然而,hadoop帶有默認的映射器/縮減器實現;這是由豬使用 - 當地圖縮小班沒有確定。

Further Pig使用Hadoop中的屬性及其特定屬性。嘗試設置,在豬腳本中的屬性下方,它也應該由Pig挑選。

SET mapred.mapper.class="<fully qualified classname for mapper>" 
SET mapred.reducer.class="<fully qualified classname for reducer>" 

同樣可以使用-Dmapred.mapper.class選項設置。基於Hadoop的你安裝完整列表here ,性質可能是還有:

mapreduce.map.class 
mapreduce.reduce.class 

僅供參考...

hadoop.mapred已被棄用。 0.20.1之前的版本使用mapred。 之後的版本使用mapreduce。

而且豬都有自己的一組屬性,可以使用命令pig -help properties

e.g. in my pig installation, below are the properties: 

The following properties are supported: 
    Logging: 
     verbose=true|false; default is false. This property is the same as -v switch 
     brief=true|false; default is false. This property is the same as -b switch 
     debug=OFF|ERROR|WARN|INFO|DEBUG; default is INFO. This property is the same as -d switch 
     aggregate.warning=true|false; default is true. If true, prints count of warnings 
      of each type rather than logging each warning. 
    Performance tuning: 
     pig.cachedbag.memusage=<mem fraction>; default is 0.2 (20% of all memory). 
      Note that this memory is shared across all large bags used by the application. 
     pig.skewedjoin.reduce.memusagea=<mem fraction>; default is 0.3 (30% of all memory). 
      Specifies the fraction of heap available for the reducer to perform the join. 
     pig.exec.nocombiner=true|false; default is false. 
      Only disable combiner as a temporary workaround for problems. 
     opt.multiquery=true|false; multiquery is on by default. 
      Only disable multiquery as a temporary workaround for problems. 
     opt.fetch=true|false; fetch is on by default. 
      Scripts containing Filter, Foreach, Limit, Stream, and Union can be dumped without MR jobs. 
     pig.tmpfilecompression=true|false; compression is off by default. 
      Determines whether output of intermediate jobs is compressed. 
     pig.tmpfilecompression.codec=lzo|gzip; default is gzip. 
      Used in conjunction with pig.tmpfilecompression. Defines compression type. 
     pig.noSplitCombination=true|false. Split combination is on by default. 
      Determines if multiple small files are combined into a single map. 
     pig.exec.mapPartAgg=true|false. Default is false. 
      Determines if partial aggregation is done within map phase, 
      before records are sent to combiner. 
     pig.exec.mapPartAgg.minReduction=<min aggregation factor>. Default is 10. 
      If the in-map partial aggregation does not reduce the output num records 
      by this factor, it gets disabled. 
    Miscellaneous: 
     exectype=mapreduce|local; default is mapreduce. This property is the same as -x switch 
     pig.additional.jars.uris=<comma seperated list of jars>. Used in place of register command. 
     udf.import.list=<comma seperated list of imports>. Used to avoid package names in UDF. 
     stop.on.failure=true|false; default is false. Set to true to terminate on the first error. 
     pig.datetime.default.tz=<UTC time offset>. e.g. +08:00. Default is the default timezone of the host. 
      Determines the timezone used to handle datetime datatype and UDFs. Additionally, any Hadoop property can be specified.