我相信/usr/bin/python3
不撿,你是下spark-env
範圍定義在羣集配置環境變量PYTHONHASHSEED
。
你應該使用python34
代替/usr/bin/python3
並設置配置如下:
[
{
"classification":"spark-defaults",
"properties":{
// [...]
}
},
{
"configurations":[
{
"classification":"export",
"properties":{
"PYSPARK_PYTHON":"python34",
"PYTHONHASHSEED":"123"
}
}
],
"classification":"spark-env",
"properties":{
// [...]
}
}
]
現在,讓我們來測試它。我定義了一個bash腳本調用這兩個python
S:
#!/bin/bash
echo "using python34"
for i in `seq 1 10`;
do
python -c "print(hash('foo'))";
done
echo "----------------------"
echo "using /usr/bin/python3"
for i in `seq 1 10`;
do
/usr/bin/python3 -c "print(hash('foo'))";
done
判決:
[[email protected] ~]$ bash test.sh
using python34
-4177197833195190597
-4177197833195190597
-4177197833195190597
-4177197833195190597
-4177197833195190597
-4177197833195190597
-4177197833195190597
-4177197833195190597
-4177197833195190597
-4177197833195190597
----------------------
using /usr/bin/python3
8867846273747294950
-7610044127871105351
6756286456855631480
-4541503224938367706
7326699722121877093
3336202789104553110
3462714165845110404
-5390125375246848302
-7753272571662122146
8018968546238984314
PS1:我使用AMI發佈emr-4.8.2
。
PS2:片段靈感來自this answer。
編輯:我有以下使用pyspark
測試。
16/11/22 07:16:56 INFO EventLoggingListener: Logging events to hdfs:///var/log/spark/apps/application_1479798580078_0001
16/11/22 07:16:56 INFO YarnClientSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.8
Welcome to
____ __
/__/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__/.__/\_,_/_/ /_/\_\ version 1.6.2
/_/
Using Python version 3.4.3 (default, Sep 1 2016 23:33:38)
SparkContext available as sc, HiveContext available as sqlContext.
>>> print(hash('foo'))
-2457967226571033580
>>> print(hash('foo'))
-2457967226571033580
>>> print(hash('foo'))
-2457967226571033580
>>> print(hash('foo'))
-2457967226571033580
>>> print(hash('foo'))
-2457967226571033580
還創建了一個簡單的應用程序(simple_app.py
):
from pyspark import SparkContext
sc = SparkContext(appName = "simple-app")
numbers = [hash('foo') for i in range(10)]
print(numbers)
這也似乎完美地工作:
[[email protected]*** ~]$ spark-submit --master yarn simple_app.py
輸出(部分):
[...]
16/11/22 07:28:42 INFO YarnClientSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.8
[-5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594] // THE RELEVANT LINE IS HERE.
16/11/22 07:28:42 INFO SparkContext: Invoking stop() from shutdown hook
[...]
由於您可以看到它也可以每次都返回相同的散列值。
編輯2:從意見,好像你正試圖計算的執行人,而不是司機哈希,因此你需要設置spark.executorEnv.PYTHONHASHSEED
,你的火花應用程序配置裏面,因此它可以被傳播對執行者(這是執行此操作的一種方式)。
注:設置環境變量執行人是紗的客戶端一樣,使用spark.executorEnv.[EnvironmentVariableName].
因此,下面的簡約例子與simple_app.py
:
from pyspark import SparkContext, SparkConf
conf = SparkConf().set("spark.executorEnv.PYTHONHASHSEED","123")
sc = SparkContext(appName="simple-app", conf=conf)
numbers = sc.parallelize(['foo']*10).map(lambda x: hash(x)).collect()
print(numbers)
現在讓我們來測試一下再次。這裏是截斷輸出:
16/11/22 14:14:34 INFO DAGScheduler: Job 0 finished: collect at /home/hadoop/simple_app.py:6, took 14.251514 s
[-5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594]
16/11/22 14:14:34 INFO SparkContext: Invoking stop() from shutdown hook
我認爲這涵蓋了所有。
感謝您的回答,但不幸的是它似乎沒有工作。你的腳本有問題,配置只將spark python版本設置爲python34,默認shell「python」仍然指向Python2.x。如果用/ usr/bin/python34替換python,則每次都會看到不同的哈希值。 –
您的示例仍然只在一個Python實例的驅動程序節點上運行。如果你創建一個並行集合並通過spark-submit運行它,你將看到不同的散列值(或者至少我爲3節點集羣做的)。如果您將「numbers = ...」行替換爲:numbers = sc.parallelize(['foo'] * 10).map(lambda x:hash(x))。collect() –
Thanks,this is a great解決方案,並做我所需要的。 –