2017-06-14 119 views
0

我是hadoop的新手,剛開始嘗試使用scala和spark連接到hdfs,但不知道配置有什麼問題。請幫我解決並理解它。scala的hdfs連接錯誤

Hadoop Version is 2.7.3 
Scala Version is 2.12.1 
Spark Version is 2.1.1 

的pom.xml(依賴關係)

 <dependency> 
      <groupId>org.apache.spark</groupId> 
      <artifactId>spark-core_2.11</artifactId> 
      <version>2.1.1</version> 
     </dependency> 

     <dependency> 
      <groupId>org.apache.hadoop</groupId> 
      <artifactId>hadoop-hdfs</artifactId> 
      <version>2.7.3</version> 
     </dependency> 

階代碼庫

object SparkHDFS { 
    def getDataFromHdfs { 
    val hdfs = FileSystem.get(new URI("hdfs://localhost:9000"), new Configuration) 
    val file = new Path("rdd/insurance.csv") 
    val stream = hdfs open file 
    println(stream.readLine()) 
    } 

    def main(arr: Array[String]) { 
    getDataFromHdfs 
    } 
} 

異常上控制檯:

log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory). 
log4j:WARN Please initialize the log4j system properly. 
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info. 
Exception in thread "main" java.util.ServiceConfigurationError: org.apache.hadoop.fs.FileSystem: Provider org.apache.hadoop.hdfs.DistributedFileSystem could not be instantiated 
    at java.util.ServiceLoader.fail(ServiceLoader.java:232) 
    at java.util.ServiceLoader.access$100(ServiceLoader.java:185) 
    at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:384) 
    at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404) 
    at java.util.ServiceLoader$1.next(ServiceLoader.java:480) 
    at org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2400) 
    at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2411) 
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2428) 
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:88) 
    at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2467) 
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2449) 
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:367) 
    at com.sample.sparkscala.SparkHDFS$.getDataFromHdfs(SparkHDFS.scala:11) 
    at com.sample.sparkscala.SparkHDFS$.main(SparkHDFS.scala:18) 
    at com.sample.sparkscala.SparkHDFS.main(SparkHDFS.scala) 
Caused by: java.lang.NoClassDefFoundError: org/apache/hadoop/conf/Configuration$DeprecationDelta 
    at org.apache.hadoop.hdfs.HdfsConfiguration.addDeprecatedKeys(HdfsConfiguration.java:66) 
    at org.apache.hadoop.hdfs.HdfsConfiguration.<clinit>(HdfsConfiguration.java:31) 
    at org.apache.hadoop.hdfs.DistributedFileSystem.<clinit>(DistributedFileSystem.java:116) 
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) 
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) 
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) 
    at java.lang.reflect.Constructor.newInstance(Constructor.java:423) 
    at java.lang.Class.newInstance(Class.java:442) 
    at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:380) 
    ... 12 more 
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.conf.Configuration$DeprecationDelta 
    at java.net.URLClassLoader.findClass(URLClassLoader.java:381) 
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424) 
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335) 
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357) 
    ... 21 more 
+0

你是否從eclipse觸發了你的scala代碼? 試着製作一個罐子;並以「hadoop jar ...」命令提交。 這樣,它將從classpath中獲取所有與hadoop相關的lib jar。 –

回答

0

這不是你讀火花文件的方式。 Spark提供對csv文件的內置支持。

按照一些教程來閱讀火花中的hdfs文件。

這裏是火花

val df = spark.read 
     .format("csv") 
     .option("header", "true") //reading the headers 
     .option("mode", "DROPMALFORMED") 
     .csv(""hdfs://localhost:9000") 

你還需要使用Scala的2.11.x,而不是2.12.x

希望這有助於閱讀CSV文件,簡單的例子!

+0

「火花」是火花上下文的對象,如果是,我得到編譯時錯誤,它說「讀取」不是org.apache.spark.SparkContext的成員。 –

+0

你能告訴我一些從哪裏可以得到關於spark hdfs連接的信息。 –

+0

spark是SparkSession的對象val spark = SparkSession.builder().master(「local」).appName(「test」)。getOrCreate() –