2017-07-19 123 views
0

我在Zeppelin中遇到了一個問題,當我嘗試對我創建的臨時表(數據框)執行SQL操作時,我總是得到一個IndexOutOfBounds錯誤。Zeppelin中的IndexOutOfBounds錯誤

這裏是我的代碼:

import org.apache.commons.io.IOUtils 
import java.net.URL 
import java.nio.charset.Charset 
import org.apache.spark.sql.SparkSession 
//import sqlContext._ 

val realdata = sc.textFile("/root/application.txt") 

case class testClass(date: String, time: String, level: String, unknown1: String, unknownConsumer: String, unknownConsumer2: String, vloer: String, tegel: String, msg: String, sensor1: String, sensor2: String, sensor3: String, sensor4: String, sensor5: String, sensor6: String, sensor7: String, sensor8: String, batchsize: String, troepje1: String, troepje2: String) 

val mapData = realdata 
.filter(line => line.contains("data") && line.contains("INFO")) 
.map(s => s.split(" ").toList) 
.map(
s => testClass(s(0), 
s(1).split(",")(0), 
s(1).split(",")(1), 
s(3), 
s(4), 
s(5), 
s(6), 
s(7), 
s(8), 
s(15), 
s(16), 
s(17), 
s(18), 
s(19), 
s(20), 
s(21), 
s(22), 
"", 
"", 
"" 
) 
).toDF 
//mapData.count() 
//mapData.printSchema() 
mapData.registerTempTable("temp_carefloor") 

然後在未來的筆記本我試着像一些簡單:

%sql 
select * from temp_carefloor limit 10 

我收到以下錯誤:

java.lang.IndexOutOfBoundsException: 18 
    at scala.collection.LinearSeqOptimized$class.apply(LinearSeqOptimized.scala:65) 
    at scala.collection.immutable.List.apply(List.scala:84) 
    at $line128330188484.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$3.apply(<console>:84) 
    at $line128330188484.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$3.apply(<console>:72) 
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) 
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) 
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) 
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:232) 
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225) 
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826) 
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826) 
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) 
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) 
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) 
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) 
    at org.apache.spark.scheduler.Task.run(Task.scala:99) 
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) 
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 
    at java.lang.Thread.run(Thread.java:748) 

現在我我確定它與我的數據輸出方式有關。 但我只是無法弄清楚我做錯了什麼,我真的在這裏打我的頭。真的希望有人能幫助我。

編輯: 這裏是我嘗試提取的有用數據的摘錄。

2016-03-10 07:18:58,985 INFO [http-nio-8080-exec-1] n.t.f.c.FloorUpdateController [FloorUpdateController.java:67] Floor 12FR received update from tile: 12G0, data = [false, false, false, false, true, false, false, false] 
2016-03-10 07:18:58,992 INFO [http-nio-8080-exec-7] n.t.f.c.FloorUpdateController [FloorUpdateController.java:67] Floor 12FR received update from tile: 12G0, data = [false, false, false, false, false, false, false, false] 
2016-03-10 07:18:59,907 INFO [http-nio-8080-exec-4] n.t.f.c.FloorUpdateController [FloorUpdateController.java:67] Floor 12FR received update from tile: 12G0, data = [false, false, false, false, false, false, false, false] 
2016-03-10 07:19:10,418 INFO [http-nio-8080-exec-9] n.t.f.c.FloorUpdateController [FloorUpdateController.java:67] Floor 12FR received update from tile: 12G0, data = [true, true, false, false, false, false, false, false] 

您可以在這裏看到完整的平面文件:http://upload.grecom.nl/uploads/jeffrey/application.txt

+1

你的數據肯定存在問題,請你提供一個樣本,以便我們可以看看 –

+0

我編輯了我的問題,以便你可以看到數據和完整的平面文件。感謝那。 – Jdeboer

+1

我注意到的第一件事是當你用''''分割你的行時''''''''''''''''''''''''''''因爲它們被空間包圍着,所以'tile:'和'='這些字段我認爲這對你來說是一個問題? –

回答

2

因此,當我們在評論中的問題在數據分解討論過,你不能將數據與" "分裂。

一種解決方案是使用正則表達式像分割數據這" data = |tile: |[|]| |,"

您必須包括你不希望他們在提取的領域,如我在做正則表達式的所有分隔符(連子" data = "

希望這可以幫助你。最好的祝福。

+0

謝謝,我將嘗試重寫它 – Jdeboer

+1

@Jdeboer只是在獲得正確的表達式並更正答案時更新答案 –

+0

我將定義這樣做。我只是在努力尋找正確的正則表達式。 – Jdeboer