我有一個數據集,我想提取那些(審查/文本)具有(審查/時間)的x和y之間,例如(1183334400 <時間< 1185926400),RDD濾波器階火花
這裏是
product/productId: B000278ADA
product/title: Jobst Ultrasheer 15-20 Knee-High Silky Beige Large
product/price: 46.34
review/userId: A17KXW1PCUAIIN
review/profileName: Mark Anthony "Mark"
review/helpfulness: 4/4
review/score: 5.0
review/time: 1174435200
review/summary: Jobst UltraSheer Knee High Stockings
review/text: Does a very good job of relieving fatigue.
product/productId: B000278ADB
product/title: Jobst Ultrasheer 15-20 Knee-High Silky Beige Large
product/price: 46.34
review/userId: A9Q3932GX4FX8
review/profileName: Trina Wehle
review/helpfulness: 1/1
review/score: 3.0
review/time: 1352505600
review/summary: Delivery was very long wait.....
review/text: It took almost 3 weeks to recieve the two pairs of stockings .
product/productId: B000278ADB
product/title: Jobst Ultrasheer 15-20 Knee-High Silky Beige Large
product/price: 46.34
review/userId: AUIZ1GNBTG5OB
review/profileName: dgodoy
review/helpfulness: 1/1
review/score: 2.0
review/time: 1287014400
review/summary: sizes recomended in the size chart are not real
review/text: sizes are much smaller than what is recomended in the chart. I tried to put it and sheer it!.
我的火花Scala代碼:我的數據樣本
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.io.{LongWritable, Text}
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat
import org.apache.spark.{SparkConf, SparkContext}
object test1 {
def main(args: Array[String]): Unit = {
val conf1 = new SparkConf().setAppName("golabi1").setMaster("local")
val sc = new SparkContext(conf1)
val conf: Configuration = new Configuration
conf.set("textinputformat.record.delimiter", "product/title:")
val input1=sc.newAPIHadoopFile("data/Electronics.txt", classOf[TextInputFormat], classOf[LongWritable], classOf[Text], conf)
val lines = input1.map { text => text._2}
val filt = lines.filter(text=>(text.toString.contains(tt => tt in (startdate until enddate))))
filt.saveAsTextFile("data/filter1")
}
}
,但我的代碼不能正常工作,
如何過濾這些行?
我在輸入文件中看不到分隔字符串「product/productId:」。 – ipoteka
你期望輸出什麼,你面臨什麼問題? – maasg