-1

我是新來的火花和機器學習,所以爲了練習,我試圖在spark 1.6.0中使用數據集編寫k-means算法。 我按照apache spark網站上的示例中的說明進行操作。 ,而這樣做,所以我得到這個錯誤:火花scala中的java.lang.NumberFormatException錯誤

java.lang.NumberFormatException: For input string: "2014-03-15:10:10:20,Sorrento,8cc3b47e-bd01-4482-b500-28f2342679af,33.6894754264,-1 17.543308253"

我的代碼中,我得到這個錯誤:

scala> val rdd = sc.textFile("/user/rohitchopra32_gmail/Project2_Dataset") 
rdd: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[1] at textFile at <console>:28 
scala> rdd.count() 
res0: Long = 459540                
scala> val parsedData = rdd.map(s => Vectors.dense(s.split(' ').map(_.toDouble))).cache() 
parsedData: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] = MapPartitionsRDD[2] at map at <console>:30 

scala> parsedData.count() 
17/08/21 16:22:50 ERROR TaskSetManager: Task 0 in stage 2.0 failed 4 times; aborting job 
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 10, ip- 
172-31-58-214.ec2.internal): java.lang.NumberFormatException: For input string: "2014-03-15:10:10:20,Sorrento,8cc3b47e-bd01-4482-b500-28f2342679af,33.6894754264,-1 
17.543308253" 

完全回溯:

scala> parsedData.count() 
17/08/21 16:22:50 ERROR TaskSetManager: Task 0 in stage 2.0 failed 4 times; aborting job 
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 10, ip- 
172-31-58-214.ec2.internal): java.lang.NumberFormatException: For input string: "2014-03-15:10:10:20,Sorrento,8cc3b47e-bd01-4482-b500-28f2342679af,33.6894754264,-1 
17.543308253" 
     at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2043) 
     at sun.misc.FloatingDecimal.parseDouble(FloatingDecimal.java:110) 
     at java.lang.Double.parseDouble(Double.java:538) 
     at scala.collection.immutable.StringLike$class.toDouble(StringLike.scala:232) 
     at scala.collection.immutable.StringOps.toDouble(StringOps.scala:31) 
     at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1$$anonfun$apply$1.apply(<console>:30) 
     at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1$$anonfun$apply$1.apply(<console>:30) 
     at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) 
     at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) 
     at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) 
     at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) 
     at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) 
     at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) 
     at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:30) 
     at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:30) 
     at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) 
     at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:283) 
     at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171) 
     at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78) 
     at org.apache.spark.rdd.RDD.iterator(RDD.scala:268) 
     at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) 
     at org.apache.spark.scheduler.Task.run(Task.scala:89) 
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) 
     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
     at java.lang.Thread.run(Thread.java:745) 
Driver stacktrace: 
     at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431) 
     at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419) 
     at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418) 
     at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) 
     at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) 
     at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418) 
     at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799) 
     at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799) 
     at scala.Option.foreach(Option.scala:236) 
     at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799) 
     at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640) 
     at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599) 
     at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588) 
     at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) 
     at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620) 
     at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832) 
     at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845) 
     at org.apache.spark.SparkContext.runJob(SparkContext.scala:1858) 
     at org.apache.spark.SparkContext.runJob(SparkContext.scala:1929) 
     at org.apache.spark.rdd.RDD.count(RDD.scala:1143) 
     at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:33) 
     at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:38) 
     at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:40) 
     at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:42) 
     at $iwC$$iwC$$iwC$$iwC.<init>(<console>:44) 
     at $iwC$$iwC$$iwC.<init>(<console>:46) 
     at $iwC$$iwC.<init>(<console>:48) 
     at $iwC.<init>(<console>:50) 
     at <init>(<console>:52) 
at .<init>(<console>:56) 
     at .<clinit>(<console>) 
     at .<init>(<console>:7) 
     at .<clinit>(<console>) 
     at $print(<console>) 
     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
     at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
     at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 
     at java.lang.reflect.Method.invoke(Method.java:497) 
     at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065) 
     at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346) 
     at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840) 
     at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871) 
     at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819) 
     at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857) 
     at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902) 
     at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814) 
     at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657) 
     at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665) 
     at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670) 
     at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997) 
     at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945) 
     at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945) 
     at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) 
     at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945) 
     at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059) 
     at org.apache.spark.repl.Main$.main(Main.scala:31) 
     at org.apache.spark.repl.Main.main(Main.scala) 
     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
     at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
     at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 
     at java.lang.reflect.Method.invoke(Method.java:497) 
     at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731) 
     at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181) 
     at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206) 
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121) 
     at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 
Caused by: java.lang.NumberFormatException: For input string: "2014-03-15:10:10:20,Sorrento,8cc3b47e-bd01-4482-b500-28f2342679af,33.6894754264,-117.543308253" 
     at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2043) 
     at sun.misc.FloatingDecimal.parseDouble(FloatingDecimal.java:110) 
     at java.lang.Double.parseDouble(Double.java:538) 
     at scala.collection.immutable.StringLike$class.toDouble(StringLike.scala:232) 
     at scala.collection.immutable.StringOps.toDouble(StringOps.scala:31) 
     at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1$$anonfun$apply$1.apply(<console>:30) 
     at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1$$anonfun$apply$1.apply(<console>:30) 
     at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) 
     at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) 
     at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) 
     at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) 
     at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) 
     at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) 
     at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:30) 
     at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:30) 
     at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) 
     at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:283) 
     at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171) 
     at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78) 
     at org.apache.spark.rdd.RDD.iterator(RDD.scala:268) 
     at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) 
     at org.apache.spark.scheduler.Task.run(Task.scala:89) 
     at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) 
     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
     at java.lang.Thread.run(Thread.java:745) 

這裏是鏈接的數據集: dataset

幫我解決這個錯誤。 在此先感謝。

+0

您正在拆分空間,但您的數據看起來像是由昏迷分隔的CSV – glefait

+0

您想做什麼?表達式's.split('').map(_。toDouble))'用空格分隔每行,並嘗試將所有生成的元素轉換爲雙精度。這與輸入日期不匹配,逗號分隔且包含各種數據類型: '「2014-03-15:10:10:20,Sorrento,8cc3b47e-bd01-4482-b500-28f2342679af,33.6894754264, - 1 17.543308253「' – Harald

回答

0

參考輸入數據

2014-03-15:10:10:20,Sorrento,8cc3b47e-bd01-4482-b500-28f2342679af,33.6894754264,-1 17.543308253 

您正試圖獲得雙值(17.543308253),將其轉換爲double,並用它來Vectors.dense機器學習。

如果那是的話,那麼你應該做

val parsedData = rdd.map(s => Vectors.dense(s.split(' ')(1).toDouble)).cache() 

如果你這樣做Vectors.dense(s.split(' ').map(_.toDouble)),你想你的數據2014-03-15:10:10:20,Sorrento,8cc3b47e-bd01-4482-500-28f2342679af,33.6894754264,-1轉換爲double過,這將使你NumberFormatException

希望答案是有幫助的