從pyspark讀取hdfs中的文件

我試圖讀取我的hdfs中的文件。這裏顯示了我的hadoop文件結構。從pyspark讀取hdfs中的文件

[email protected]:/usr/local/spark/bin$ hadoop fs -ls -R/
drwxr-xr-x - hduser supergroup   0 2016-03-06 17:28 /inputFiles 
drwxr-xr-x - hduser supergroup   0 2016-03-06 17:31 /inputFiles/CountOfMonteCristo 
-rw-r--r-- 1 hduser supergroup 2685300 2016-03-06 17:31 /inputFiles/CountOfMonteCristo/BookText.txt

這裏是我的pyspark代碼：

from pyspark import SparkContext, SparkConf 

conf = SparkConf().setAppName("myFirstApp").setMaster("local") 
sc = SparkContext(conf=conf) 

textFile = sc.textFile("hdfs://inputFiles/CountOfMonteCristo/BookText.txt") 
textFile.first()

我得到的錯誤是：

Py4JJavaError: An error occurred while calling o64.partitions. 
: java.lang.IllegalArgumentException: java.net.UnknownHostException: inputFiles

這是因爲我錯誤地設置了我的sparkContext？我正在通過虛擬機在Ubuntu 14.04虛擬機中運行它。

我不知道我在做什麼錯在這裏....

來源

2016-03-07 user1357015

如果未提供任何配置，則可以通過完整路徑訪問hdfs文件（namenodehost，如果您的本地主機如果hdfs位於本地環境中）。

hdfs://namenodehost/inputFiles/CountOfMonteCristo/BookText.txt

來源

2016-03-07 14:43:22

有沒有辦法設置名稱節點主機，所以它不是硬編碼在python文件中？我們如何才能最好地參與其中？也許使用某種可以在多個應用程序之間共享的配置文件？ –

既然你不提供權威URI應該是這樣的：

hdfs:///inputFiles/CountOfMonteCristo/BookText.txt

否則inputFiles被解釋爲主機名。如果配置正確，則不需要使用方案：

/inputFiles/CountOfMonteCristo/BookText.txt

改爲。

來源

2016-03-07 05:19:38 zero323

從pyspark讀取hdfs中的文件

回答

相關問題