2016-04-24 46 views
3

我是Pyspark的新人。我在ubuntu上安裝了「bash Anaconda2-4.0.0-Linux-x86_64.sh」。還安裝了pyspark。一切工作正常在終端。我想在jupyter上工作。當我在我的Ubuntu的終端創建的配置文件,如下所示:我應該如何在Ubuntu 12.04上集成Jupyter筆記本和pyspark?

[email protected]:~$ ipython profile create pyspark 
[ProfileCreate] Generating default config file: u'/home/wanderer/.ipython/profile_pyspark/ipython_config.py' 
[ProfileCreate] Generating default config file: u'/home/wanderer/.ipython/profile_pyspark/ipython_kernel_config.py' 

[email protected]:~$ export ANACONDA_ROOT=~/anaconda2 
[email protected]:~$ export PYSPARK_DRIVER_PYTHON=$ANACONDA_ROOT/bin/ipython 
[email protected]:~$ export PYSPARK_PYTHON=$ANACONDA_ROOT/bin/python 

[email protected]:~$ cd spark-1.5.2-bin-hadoop2.6/ 
[email protected]:~/spark-1.5.2-bin-hadoop2.6$ PYTHON_OPTS=」notebook」 ./bin/pyspark 
Python 2.7.11 |Anaconda 4.0.0 (64-bit)| (default, Dec 6 2015, 18:08:32) 
Type "copyright", "credits" or "license" for more information. 

IPython 4.1.2 -- An enhanced Interactive Python. 
?   -> Introduction and overview of IPython's features. 
%quickref -> Quick reference. 
help  -> Python's own help system. 
object? -> Details about 'object', use 'object??' for extra details. 
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 
16/04/24 15:27:42 INFO SparkContext: Running Spark version 1.5.2 
16/04/24 15:27:43 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 

16/04/24 15:27:53 INFO BlockManagerMasterEndpoint: Registering block manager localhost:33514 with 530.3 MB RAM, BlockManagerId(driver, localhost, 33514) 
16/04/24 15:27:53 INFO BlockManagerMaster: Registered BlockManager 
Welcome to 
     ____    __ 
    /__/__ ___ _____/ /__ 
    _\ \/ _ \/ _ `/ __/ '_/ 
    /__/.__/\_,_/_/ /_/\_\ version 1.5.2 
     /_/ 

Using Python version 2.7.11 (default, Dec 6 2015 18:08:32) 
SparkContext available as sc, HiveContext available as sqlContext. 

In [1]: sc 
Out[1]: <pyspark.context.SparkContext at 0x7fc96cc6fd10> 

In [2]: print sc.version 
1.5.2 

In [3]: 

下面是jupyter的版本和IPython中

[email protected]:~$ jupyter --version 
4.1.0 

[email protected]:~$ ipython --version 
4.1.2 

我試圖整合jupyter筆記本和pyspark,但是每一件事情失敗了。我想在jupyter中鍛鍊,不知道如何整合jupyter筆記本和pyspark。

任何人都可以展示如何整合上述組件?

+3

選中此[鏈接jupyter和pyspark(http://stackoverflow.com/問題/ 33064031/link-spark-with-ipython-notebook/33065359#33065359) –

+0

@AlbertoBonsanto ...非常好...終於問題解決了,開始在pyspark上練習..給出的鏈接清除了我的障礙。 – Wanderer

回答

4

EDIT 2017年10月

火花2.2 findspark這個效果很好,沒必要對那些ENV瓦爾

import findspark 
findspark.init('/opt/spark') 
import pyspark 
sc = pyspark.SparkContext() 

OLD

我發現,最快的方式就是跑:

export PYSPARK_DRIVER=ipython 
export PYSPARK_DRIVER_PYTHON_OPTS="notebook" 
pyspark 

或jupyter的等價物。這應該打開啓用了pyspark的ipython筆記本。您可能還想查看Beaker notebook

+0

更容易,在命令行運行:'IPYTHON_OPTS =「notebook」$ SPARK_HOME/bin/pyspark'。發現[這裏](http://npatta01.github.io/2015/08/01/pyspark_jupyter/) – citynorman

+0

'IPYTHON_OPTS =「notebook」$ SPARK_HOME/bin/pyspark'似乎在Spark 2.0+中被刪除 – Neal

8

添加到pyspark使用nano或vim的兩條線:

PYSPARK_DRIVER_PYTHON="jupyter" 
PYSPARK_DRIVER_PYTHON_OPTS="notebook" 
9

只要運行命令:

PYSPARK_DRIVER_PYTHON="jupyter" PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark