我們使用Spark CSV閱讀器讀取要轉換爲DataFrame的csv文件,並且我們在yarn-client
上運行作業,其在本地模式下正常工作。無法從本地文件路徑讀取文本文件 - Spark CSV閱讀器
我們正在提交edge node
的點火工作。
但是,當我們將文件放在本地文件路徑而不是HDFS中時,我們得到的文件未找到異常。
代碼:
sqlContext.read.format("com.databricks.spark.csv")
.option("header", "true").option("inferSchema", "true")
.load("file:/filepath/file.csv")
我們也嘗試file:///
,但我們仍然是得到同樣的錯誤。
錯誤日誌:
2016-12-24 16:05:40,044 WARN [task-result-getter-0] scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, hklvadcnc06.hk.standardchartered.com): java.io.FileNotFoundException: File file:/shared/sample1.csv does not exist
at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:609)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:822)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:599)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:140)
at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:341)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:767)
at org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordReader.java:109)
at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:241)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:212)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
該文件是否存在於該位置? – mrsrinivas
@mrsrinivas:是可用的,這就是爲什麼當我以本地模式在紗線集羣中運行作業時,它的工作正常,只有它不能在紗線客戶端模式下工作。 – Shankar
在正常情況下,它必須按照您的嘗試工作。 但是,如果意圖是使其工作,然後嘗試[SparkFiles](https://spark.apache.org/docs/1.6.0/api/java/org/apache/spark/SparkFiles.html)您的情況像這樣的'進口org.apache.spark.SparkFiles SparkContext.addFile( 「文件:/filepath/file.csv」) 的println(SparkFiles.getRootDirectory()) 的println(SparkFiles.get( 「FILE.CSV」 )) sqlContext.read.format(「com.databricks.spark.csv」) .option(「header」,「true」)。option(「inferSchema」,「true」) .load(SparkFiles.get (「file.csv」))' –