2017-04-07 94 views
0

我有一個RDD,我使用地圖從數據幀變成:火花拋出java.lang.NullPointerException時與空的java語音匹配庫映射RDD值

case class Record(id_1: Int, fnam_1: String, lnam_1: String, id_2: Long, fnam_2: String, lnam_2: String) 
val rdd = df.map { 
    case Row(id_1: Int, fnam_1: String, lnam_1: String, id_2: Long, fnam_2: String, lnam_2: String) => 
    Record(id_1, fnam_1, lnam_1, id_2, fnam_2, lnam_2) 
} 

然後我在此RDD執行濾波操作使用Java拼音匹配庫(如下所示):

import edu.ualr.oyster.utilities.DoubleMetaphone 

def matchFirstName(rec: Record) = { 
    val s1 = Option(rec.fnam_1).getOrElse("") 
    val s2 = Option(rec.fnam_2).getOrElse("") 
    if (s1.isEmpty || s2.isEmpty) 
    false 
    else 
    new DoubleMetaphone().compareDoubleMetaphone(s1, s2) 
} 

val rdd_filtered = rdd.filter(matchFirstName(_)) 

當運行此,我得到一個NPE錯誤:

17/04/06 19:06:31 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 4.0 (TID 160, my.work.cluster.com): java.lang.NullPointerException 
    at edu.ualr.oyster.utilities.DoubleMetaphone.compareDoubleMetaphone(DoubleMetaphone.java:1020) 
    at funpackage.EntityResolution$.phoneticMatching(EntityResolution.scala:106) 
    at esurance.EntityResolution$.esurance$EntityResolution$$matchNames$1(EntityResolution.scala:118) 
    at esurance.EntityResolution$$anonfun$8.apply(EntityResolution.scala:137) 
    at esurance.EntityResolution$$anonfun$8.apply(EntityResolution.scala:137) 
    at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:390) 
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) 
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) 
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308) 
    at scala.collection.Iterator$class.foreach(Iterator.scala:727) 
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) 
    at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) 
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) 
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) 
    at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) 
    at scala.collection.AbstractIterator.to(Iterator.scala:1157) 
    at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) 
    at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) 
    at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) 
    at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) 
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215) 
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215) 
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850) 
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850) 
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) 
    at org.apache.spark.scheduler.Task.run(Task.scala:88) 
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) 
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 
    at java.lang.Thread.run(Thread.java:745) 

我試着在項目中的一對字符串上使用拼音匹配,它確實沒有問題地工作。我也使用了用戶定義函數中包裝的spark sql中的相同庫,沒有任何問題。我懷疑這個問題可能是由於我的某些值可能會丟失(null)導致的。但我試圖在那裏用Option來照顧。任何想法爲什麼這是失敗?

+0

雖然斯卡拉不是非常地道,但我沒有看到任何明顯的產生NPE。你看看's1'和's2'的哪個值觸發異常嗎?你看過「DoubleMetaphone」的第120行,看看會發生什麼嗎?例如,如果'DoubleMetaphone'對產生'null'的空字符串做了什麼?我不是說發生了什麼,但我認爲你有很多調查的途徑。 – Vidya

回答

0

我沒有試圖挖掘edu.ualr.oyster庫以查看它是否導致異常。但似乎是這樣。我切換到使用org.apache.commons.codec.language庫(相同的雙重metaphone功能)和程序工作在火花沒有問題。