指定AWS EMR自定義jar應用程序中的其他jar

我試圖在EMR集羣上運行hadoop作業。它正在作爲我使用jar-with-dependencies的Java命令運行。這項工作從Teradata中提取數據，我認爲Teradata相關的jar也包含在jar-with-dependencies中。不過，我仍然得到異常：指定AWS EMR自定義jar應用程序中的其他jar

Exception in thread "main" java.lang.RuntimeException: java.lang.RuntimeException: java.lang.ClassNotFoundException: com.teradata.jdbc.TeraDriver 
at org.apache.hadoop.mapreduce.lib.db.DBInputFormat.setConf(DBInputFormat.java:171)

我pom具有以下相關依存關係：

<dependency> 
    <groupId>teradata</groupId> 
    <artifactId>terajdbc4</artifactId> 
    <version>14.10.00.17</version> 
</dependency> 

<dependency> 
    <groupId>teradata</groupId> 
    <artifactId>tdgssconfig</artifactId> 
    <version>14.10.00.17</version> 
</dependency>

我包裝完整的水瓶中下：

<build> 
    <plugins> 
     <plugin> 
     <artifactId>maven-compiler-plugin</artifactId> 
     <version>3.1</version> 
     <configuration> 
      <source>1.8</source> 
      <target>1.8</target> 
      <compilerArgument>-Xlint:-deprecation</compilerArgument> 
     </configuration> 
     </plugin> 

     <plugin> 
     <artifactId>maven-assembly-plugin</artifactId> 
     <version>2.2.1</version> 

     <configuration> 
      <descriptors> 
      </descriptors> 
      <archive> 
      <manifest> 
      </manifest> 
      </archive> 
      <descriptorRefs> 
      <descriptorRef>jar-with-dependencies</descriptorRef> 
      </descriptorRefs> 
     </configuration> 

     <executions> 
      <execution> 
      <id>make-assembly</id> 
      <phase>package</phase> 
      <goals> 
       <goal>single</goal> 
      </goals> 
      </execution> 
     </executions> 
     </plugin> 

    </plugins> 
    </build>

assembly.xml文件：

<assembly> 
    <id>aws-emr</id> 
    <formats> 
     <format>jar</format> 
    </formats> 
    <includeBaseDirectory>false</includeBaseDirectory> 
    <dependencySets> 
     <dependencySet> 
      <unpack>false</unpack> 
      <includes> 
      </includes> 
      <scope>runtime</scope> 
      <outputDirectory>lib</outputDirectory> 
     </dependencySet> 
     <dependencySet> 
      <unpack>true</unpack> 
      <includes> 
       <include>${groupId}:${artifactId}</include> 
      </includes> 
     </dependencySet> 
    </dependencySets> 
</assembly>

運行EMR命令：

aws emr create-cluster --release-label emr-5.3.1 \ 
--instance-groups \ 
    InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge \ 
    InstanceGroupType=CORE,InstanceCount=5,BidPrice=0.1,InstanceType=m3.xlarge \ 
--service-role EMR_DefaultRole --log-uri s3://my-bucket/logs \ 
--applications Name=Hadoop --name TeradataPullerTest \ 
--ec2-attributes <ec2-attributes> \ 

--steps Type=CUSTOM_JAR,Name=EventsPuller,Jar=s3://path-to-jar-with-dependencies.jar,\ 
Args=[com.my.package.EventsPullerMR],ActionOnFailure=TERMINATE_CLUSTER \ 
--auto-terminate

有沒有我可以指定Teradata的罐子在執行的map-reduce任務，使得它們添加到類路徑的方法嗎？

編輯：我確認缺少的類是打包在jar-with-dependencies中的。

aws-emr$ jar tf target/aws-emr-0.0.1-SNAPSHOT-jar-with-dependencies.jar | grep TeraDriver 
com/ncr/teradata/TeraDriver.class 
com/teradata/jdbc/TeraDriver.class

來源

2017-03-09 Nik

我還沒有完全解決這個問題，但找到了一種方法來使這項工作。理想的解決方案應該在超級罐子裏包裝Teradata罐子。這仍然在發生，但是這些jar不會被添加到類路徑中。我不確定爲什麼會這樣。

我通過創建2個獨立的jar來解決這個問題 - 一個用於我的代碼包，另一個用於所有需要的依賴關係。我將這兩個罐子都上傳到了S3，然後寫了一個腳本，它執行以下操作（僞代碼）：

# download main jar 
aws s3 cp <s3-path-to-myjar.jar> . 

# download dependency jar in a temp directory 
aws s3 cp <s3-path-to-dependency-jar> temp 

# unzip the dependencies jar into another directory (say `jars`) 
unzip -j temp/dependencies.jar <path-within-jar-to-unzip>/* -d jars 

LIBJARS=`find jars/*.jar | tr -s '\n' ','` 

HADOOP_CLASSPATH=`echo ${LIBJARS} | sed s/,/:/g` 

CLASSPATH=$HADOOP_CLASSPATH 

export CLASSPATH HADOOP_CLASSPATH 

# run via hadoop command 
hadoop jar myjar.jar com.my.package.EventsPullerMR -libjars ${LIBJARS} <arguments to the job>

這將開始工作。

來源

2017-03-15 02:36:44 Nik

指定AWS EMR自定義jar應用程序中的其他jar

回答

相關問題