2017-09-15 59 views
1

我寫了一個Spark應用程序,讀取一些CSV文件(〜5-10 GB),轉換數據並將數據轉換爲HFile。數據從HDFS中讀取並保存到HDFS中。Spark將不會在紗線簇模式下運行最後的`saveAsNewAPIHadoopFile`方法

當我在yarn-client模式下運行應用程序時,似乎一切正常。

但是當我嘗試運行它作爲yarn-cluster應用程序時,該過程似乎不運行最終saveAsNewAPIHadoopFile行動對我的轉換和準備保存的RDD!

這裏是我的星火UI,在這裏你可以看到所有其他作業的處理的快照:

enter image description here

以及相應階段:

enter image description here

這裏我的應用程序的最後一步,其中調用saveAsNewAPIHadoopFile方法:

JavaPairRDD<ImmutableBytesWritable, KeyValue> cells = ... 

try { 
    Connection c = HBaseKerberos.createHBaseConnectionKerberized("userpricipal", "/etc/security/keytabs/user.keytab"); 
    Configuration baseConf = c.getConfiguration(); 
    baseConf.set("hbase.zookeeper.quorum", HBASE_HOST); 
    baseConf.set("zookeeper.znode.parent", "/hbase-secure"); 

    Job job = Job.getInstance(baseConf, "Test Bulk Load"); 
    HTable table = new HTable(baseConf, "map_data");   

    HBaseAdmin admin = new HBaseAdmin(baseConf);   
    HFileOutputFormat2.configureIncrementalLoad(job, table);    
    Configuration conf = job.getConfiguration();   

    cells.saveAsNewAPIHadoopFile(outputPath, ImmutableBytesWritable.class, KeyValue.class, HFileOutputFormat2.class, conf); 
    System.out.println("Finished!!!!!"); 
} catch (IOException e) { 
    e.printStackTrace(); 
    System.out.println(e.getMessage()); 
} 

我通過spark-submit --master yarn --deploy-mode cluster --class sparkhbase.BulkLoadAsKeyValue3 --driver-cores 8 --driver-memory 11g --executor-cores 4 --executor-memory 9g /home/myuser/app.jar

當我看着我的HDFS的輸出目錄,它仍然是空的運行appliaction!我在HDP 2.5平臺中使用Spark 1.6.3。

所以我有兩個問題在這裏:從哪裏來的(可能是內存問題)這種行爲?紗線客戶和紗線集羣模式之間有什麼區別(我還沒有理解,還有文檔不清楚)?謝謝你的幫助!

回答

0

我發現,這個問題是關係到一個Kerberos的問題!當從我的Hadoop Namenode以yarn-client模式運行應用程序時,該驅動程序正在該節點上運行,在該節點上運行我的Kerberos服務器。因此,本機上存在文件/etc/security/keytabs/user.keytab中使用的userpricipal

yarn-cluster中運行應用程序時,驅動程序進程在我的一個Hadoop節點上隨機啓動。由於我忘記將密鑰文件複製到其他節點後,驅動程序進程當然無法找到該本地位置上的keytab文件!

所以,能在基於Kerberos的Hadoop集羣星火工作(甚至yarn-cluster模式),你必須複製誰運行​​命令對應路徑的所有節點上的用戶所需要的密鑰表文件集羣!

scp /etc/security/keytabs/user.keytab [email protected]:/etc/security/keytabs/user.keytab 

所以,你應該能夠做出kinit -kt /etc/security/keytabs/user.keytab user集羣的每個節點上。

1

看起來工作沒有開始。在開始作業之前Spark檢查可用資源。我認爲可用的資源是不夠的。因此,儘量減少配置中的驅動程序和執行程序內存,驅動程序和執行程序內核。 在這裏,您可以瞭解如何計算資源的時機值執行者和驅動器:https://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/

你的工作在客戶端模式,因爲在客戶端模式驅動器運行可以使用節點上所有可用資源。但是在集羣模式下資源是有限的。

集羣和客戶端模式的區別:
客戶:

Driver runs on a dedicated server (Master node) inside a dedicated process. This means it has all available resources at it's disposal to execute work. 
Driver opens up a dedicated Netty HTTP server and distributes the JAR files specified to all Worker nodes (big advantage). 
Because the Master node has dedicated resources of it's own, you don't need to "spend" worker resources for the Driver program. 
If the driver process dies, you need an external monitoring system to reset it's execution. 

集羣:

Driver runs on one of the cluster's Worker nodes. The worker is chosen by the Master leader 
Driver runs as a dedicated, standalone process inside the Worker. 
Driver programs takes up at least 1 core and a dedicated amount of memory from one of the workers (this can be configured). 
Driver program can be monitored from the Master node using the --supervise flag and be reset in case it dies. 
When working in Cluster mode, all JARs related to the execution of your application need to be publicly available to all the workers. This means you can either manually place them in a shared place or in a folder for each of the workers. 
+0

但Spark UI顯示其他作業和階段,我可以通過x/y進度條跟蹤它們的狀態,並且我的YARN資源管理器UI還會在運行時顯示保留的資源。 –

+0

感謝您對羣集和客戶端模式的解釋,非常好理解! –

+0

您能否提供您工作處於運行狀態的紗線用戶界面的快照。 您是否看到工作日誌? –

相關問題