簡答
您需要的外部IP地址訪問GCS和默認bdutil設置設置GCS作爲默認的文件系統的Hadoop,包括控制文件。相反,使用./bdutil -F hdfs ... deploy
將HDFS用作默認值。
使用外部IP地址時不應擔心安全問題,除非您在GCE網絡配置中爲防火牆規則添加了過多的寬容規則。
編輯:目前似乎有一個錯誤,我們將spark.eventLog.dir
設置爲GCS路徑,即使default_fs是hdfs。我提交了https://github.com/GoogleCloudPlatform/bdutil/issues/35來跟蹤這個。與此同時,只需在您的主設備上手動編輯/home/hadoop/spark-install/conf/spark-defaults.conf
(您可能需要sudo -u hadoop vim.tiny /home/hadoop/spark-install/conf/spark-defaults.conf
才能擁有編輯權限)才能將spark.eventLog.dir
設置爲hdfs:///spark-eventlog-base
或HDFS中的其他設置,然後運行以使其工作。
長的答案
默認情況下,bdutil還配置谷歌雲存儲作爲「默認文件系統Hadoop的」,這意味着由Spark和YARN使用的控制文件需要訪問谷歌雲存儲。此外,爲了訪問Google雲端存儲,還需要外部IP。
手動配置網絡內SSH後,我確實設法部分複製您的案例;在啓動過程中我居然看到以下內容:
15/06/26 17:23:05 INFO yarn.Client: Preparing resources for our AM container
15/06/26 17:23:05 INFO gcs.GoogleHadoopFileSystemBase: GHFS version: 1.4.0-hadoop2
15/06/26 17:23:26 WARN http.HttpTransport: exception thrown while executing request
java.net.SocketTimeoutException: connect timed out
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:579)
at sun.security.ssl.SSLSocketImpl.connect(SSLSocketImpl.java:625)
at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
at sun.net.www.protocol.https.HttpsClient.<init>(HttpsClient.java:275)
at sun.net.www.protocol.https.HttpsClient.New(HttpsClient.java:371)
at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.getNewHttpClient(AbstractDelegateHttpsURLConnection.java:191)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:933)
at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:177)
at sun.net.www.protocol.https.HttpsURLConnectionImpl.connect(HttpsURLConnectionImpl.java:153)
at com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:93)
at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:965)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:410)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:343)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:460)
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.getBucket(GoogleCloudStorageImpl.java:1557)
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.getItemInfo(GoogleCloudStorageImpl.java:1512)
at com.google.cloud.hadoop.gcsio.CacheSupplementedGoogleCloudStorage.getItemInfo(CacheSupplementedGoogleCloudStorage.java:516)
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.getFileInfo(GoogleCloudStorageFileSystem.java:1016)
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.exists(GoogleCloudStorageFileSystem.java:382)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.configureBuckets(GoogleHadoopFileSystemBase.java:1639)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem.configureBuckets(GoogleHadoopFileSystem.java:71)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.configure(GoogleHadoopFileSystemBase.java:1587)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.initialize(GoogleHadoopFileSystemBase.java:776)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.initialize(GoogleHadoopFileSystemBase.java:739)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2596)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:169)
at org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:216)
at org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:384)
at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:102)
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:58)
at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:141)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:381)
at org.apache.spark.repl.SparkILoop.createSparkContext(SparkILoop.scala:1016)
at $line3.$read$$iwC$$iwC.<init>(<console>:9)
at $line3.$read$$iwC.<init>(<console>:18)
at $line3.$read.<init>(<console>:20)
at $line3.$read$.<init>(<console>:24)
at $line3.$read$.<clinit>(<console>)
at $line3.$eval$.<init>(<console>:7)
at $line3.$eval$.<clinit>(<console>)
at $line3.$eval.$print(<console>)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338)
at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:856)
at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:901)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:813)
at org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:123)
at org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:122)
at org.apache.spark.repl.SparkIMain.beQuietDuring(SparkIMain.scala:324)
at org.apache.spark.repl.SparkILoopInit$class.initializeSpark(SparkILoopInit.scala:122)
at org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:64)
at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1$$anonfun$apply$mcZ$sp$5.apply$mcV$sp(SparkILoop.scala:973)
at org.apache.spark.repl.SparkILoopInit$class.runThunks(SparkILoopInit.scala:157)
at org.apache.spark.repl.SparkILoop.runThunks(SparkILoop.scala:64)
at org.apache.spark.repl.SparkILoopInit$class.postInitialization(SparkILoopInit.scala:106)
at org.apache.spark.repl.SparkILoop.postInitialization(SparkILoop.scala:64)
at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:990)
at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944)
at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944)
at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:944)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1058)
at org.apache.spark.repl.Main$.main(Main.scala:31)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
正如預期的那樣,簡單地通過調用org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start
它試圖聯繫谷歌雲存儲,並且不能因爲沒有沒有外部IP地址訪問GCS。
爲了避免這種情況,您可以在創建集羣時簡單地使用-F hdfs
以將HDFS用作默認文件系統;在這種情況下,即使沒有外部IP地址,一切都應該在羣集內工作。在該模式下,通過指定完整的gs://bucket/object
路徑作爲您的Hadoop參數,您可以在您分配了外部IP地址時甚至繼續使用GCS。但是請注意,在這種情況下,只要您刪除了外部IP地址,您將無法使用GCS,除非您還通過代理配置了代理服務器和所有數據; GCS配置爲fs.gs.proxy.address
。
一般而言,除非您在Google Compute Engine的「默認」網絡防火牆規則中打開了新的允許規則,否則不必擔心安全問題。
謝謝丹尼斯!這正是我想要發現的。 –
只是FYI,當我運行它時,我發現https://github.com/GoogleCloudPlatform/bdutil/issues/35,即使使用-F hdfs,當您刪除IP地址時,它也不能很好地工作。但似乎解決方法相當容易應用,我在上面的「編輯」聲明中添加了我的答案;您只需將'spark.eventLog.dir'指向新的HDFS路徑而不是gs://。到目前爲止,這似乎在沒有外部IP地址的情況下在我的測試用例中正常工作。 –