2015-02-05 96 views
1

不存在執行在本地模式下我的火花的應用程序工作完全正常,但在集羣上運行提供了一個例外日期字段「YYYY-MM-DD HH:MM:SS」具有以下情況除外:與內容阿帕奇星火NumberFormatException的文件

15/02/05 16:56:04 WARN TaskSetManager: Lost task 3.0 in stage 0.0 (TID 3, kmobd-dnode2.qudosoft.de): java.lang.NumberFormatException: For input string: ".1244E.1244E22" 
    at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2043) 
    at sun.misc.FloatingDecimal.parseDouble(FloatingDecimal.java:110) 
    at java.lang.Double.parseDouble(Double.java:538) 
    at java.text.DigitList.getDouble(DigitList.java:169) 
    at java.text.DecimalFormat.parse(DecimalFormat.java:2056) 
    at java.text.SimpleDateFormat.subParse(SimpleDateFormat.java:2162) 
    at java.text.SimpleDateFormat.parse(SimpleDateFormat.java:1514) 
    at java.text.DateFormat.parse(DateFormat.java:364) 
    at de.qudosoft.bd.econda.userjourneymapper.ClassifingMapper.call(ClassifingMapper.java:24) 
    at de.qudosoft.bd.econda.userjourneymapper.ClassifingMapper.call(ClassifingMapper.java:10) 
    at org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:1002) 
    at org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:1002) 
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) 
    at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:389) 
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) 
    at org.apache.spark.util.collection.ExternalSorter.spillToPartitionFiles(ExternalSorter.scala:365) 
    at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:211) 
    at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:65) 
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) 
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) 
    at org.apache.spark.scheduler.Task.run(Task.scala:56) 
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) 
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
    at java.lang.Thread.run(Thread.java:745) 

我不明白的是,我的數據中不存在值「.1244E.1244E22」。我使用Apache Spark 1.2.0和Cloudera Manager CDH 5.3.0和Hadoop 2.5.0。

這是我的pom.xml:

</dependency> 
    <dependency> 
     <groupId>org.apache.hadoop</groupId> 
     <artifactId>hadoop-client</artifactId> 
     <version>2.5.0</version> 
     <scope>provided</scope> 
    </dependency> 
    <dependency> 
     <groupId>com.google.code.gson</groupId> 
     <artifactId>gson</artifactId> 
     <version>2.3.1</version> 
    </dependency> 
    <dependency> 
     <groupId>org.testng</groupId> 
     <artifactId>testng</artifactId> 
     <version>6.1.1</version> 
     <scope>test</scope> 
    </dependency> 
</dependencies> 

<properties> 
    <java.version>1.8</java.version> 
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding> 
</properties> 
<build> 
    <plugins> 
     <plugin> 
      <groupId>org.apache.maven.plugins</groupId> 
      <artifactId>maven-compiler-plugin</artifactId> 
      <version>3.1</version> 
      <configuration> 
       <source>${java.version}</source> 
       <target>${java.version}</target> 
      </configuration> 
     </plugin> 

     <plugin> 
      <groupId>org.apache.maven.plugins</groupId> 
      <artifactId>maven-assembly-plugin</artifactId> 
      <version>2.4.1</version> 
      <configuration> 
       <!-- get all project dependencies --> 
       <descriptorRefs> 
        <descriptorRef>jar-with-dependencies</descriptorRef> 
       </descriptorRefs> 
       <!-- MainClass in mainfest make a executable jar --> 
       <archive> 
        <manifest> 
         <mainClass>de.qudosoft.bd.econda.userjourneymapper.Main</mainClass> 
        </manifest> 
       </archive> 

      </configuration> 
      <executions> 
       <execution> 
        <id>make-assembly</id> 
        <!-- bind to the packaging phase --> 
        <phase>package</phase> 
        <goals> 
         <goal>single</goal> 
        </goals> 
       </execution> 
      </executions> 
     </plugin> 

    </plugins> 
</build> 

是否有人遇到了類似的問題?

回答

3

問題很可能是您的解析器在靜態/對象實例級別定義。 SimpleDateFormat類不是線程安全的,所以狀態會被競爭線程破壞。

儘量只使用它之前移動解析器建設,在功能水平。這不是優雅或高效,但它應該證明問題。

你也可以嘗試mutexing的解析調用,看看是否有幫助。配置文件/測試它兩種方式,看看哪個更適合你。

祝你好運!

相關問題