2016-12-05 123 views
1

我想從本站運行pytest使用wordcount測試 - Unit testing Apache Spark with py.test。問題是我無法啓動火花上下文。代碼我用來運行星火語境:用pytest測試Spark - 無法在本地模式下運行Spark

@pytest.fixture(scope="session") 
def spark_context(request): 
    """ fixture for creating a spark context 
    Args: 
     request: pytest.FixtureRequest object 
    """ 
    conf = (SparkConf().setMaster("local[2]").setAppName("pytest-pyspark-local-testing")) 
    sc = SparkContext(conf=conf) 
    request.addfinalizer(lambda: sc.stop()) 

    quiet_py4j() 
    return sc 

我使用命令執行此代碼:

#first way 
pytest spark_context_fixture.py 

#second way 
python spark_context_fixture.py 

輸出:

platform linux2 -- Python 2.7.5, pytest-3.0.4, py-1.4.31, pluggy-0.4.0 
rootdir: /home/mgr/test, inifile: 
collected 0 items 

然後我想用pytest運行wordcount的測試。

pytestmark = pytest.mark.usefixtures("spark_context") 

def test_do_word_counts(spark_context): 
    """ test word couting 
    Args: 
     spark_context: test fixture SparkContext 
    """ 
    test_input = [ 
     ' hello spark ', 
     ' hello again spark spark' 
    ] 

    input_rdd = spark_context.parallelize(test_input, 1) 
    results = wordcount.do_word_counts(input_rdd) 

    expected_results = {'hello':2, 'spark':3, 'again':1} 
    assert results == expected_results 

但輸出是:

________ ERROR at setup of test_do_word_counts _________ 
file /home/mgrabowski/test/wordcount_test.py, line 5 
    def test_do_word_counts(spark_context): 
E  fixture 'spark_context' not found 
>  available fixtures: cache, capfd, capsys, doctest_namespace, monkeypatch, pytestconfig, record_xml_property, recwarn, tmpdir, tmpdir_factory 
>  use 'pytest --fixtures [testpath]' for help on them. 

有誰知道這是什麼問題的原因是什麼?

+0

你在你的機器上安裝了spark嗎? – Yaron

+0

是的,我安裝了Spark 1.6。我能夠在命令行中運行pyspark,因此看起來沒問題。 –

回答

3

我做了一些研究,最終找到了解決方案。我使用Spark 1.6。

首先,我在我的.bashrc文件中添加了兩行。

export SPARK_HOME=/usr/hdp/2.5.0.0-1245/spark 
export PYTHONPATH=$SPARK_HOME/python/:$SPARK_HOME/python/lib/py4j-0.9-src.zip:$PYTHONPA‌​TH 

然後我創建了文件「conftest.py」。文件名非常重要,你不應該改變它,否則你會看到spark_context的錯誤。如果您在本地模式Spark和不使用紗,conftest.py應該看起來像:

import logging 
import pytest 

from pyspark import HiveContext 
from pyspark import SparkConf 
from pyspark import SparkContext 
from pyspark.streaming import StreamingContext 

def quiet_py4j(): 
    logger = logging.getLogger('py4j') 
    logger.setLevel(logging.WARN) 

@pytest.fixture(scope="session") 
def spark_context(request): 
    conf = (SparkConf().setMaster("local[2]").setAppName("pytest-pyspark-local-testing")) 
    request.addfinalizer(lambda: sc.stop()) 

    sc = SparkContext(conf=conf) 
    quiet_py4j() 
    return sc 

@pytest.fixture(scope="session") 
def hive_context(spark_context): 
    return HiveContext(spark_context) 

@pytest.fixture(scope="session") 
def streaming_context(spark_context): 
    return StreamingContext(spark_context, 1) 

現在,你可以通過使用簡單的pytest命令運行測試。 Pytest應該運行Spark並終止它。

如果你使用的紗線可以conftest.py更改爲: 進口記錄 進口pytest

from pyspark import HiveContext 
from pyspark import SparkConf 
from pyspark import SparkContext 
from pyspark.streaming import StreamingContext 

def quiet_py4j(): 
    """ turn down spark logging for the test context """ 
    logger = logging.getLogger('py4j') 
    logger.setLevel(logging.WARN) 

@pytest.fixture(scope="session", 
      params=[pytest.mark.spark_local('local'), 
        pytest.mark.spark_yarn('yarn')]) 
def spark_context(request): 
    if request.param == 'local': 
     conf = (SparkConf() 
       .setMaster("local[2]") 
       .setAppName("pytest-pyspark-local-testing") 
       ) 
    elif request.param == 'yarn': 
     conf = (SparkConf() 
       .setMaster("yarn-client") 
       .setAppName("pytest-pyspark-yarn-testing") 
       .set("spark.executor.memory", "1g") 
       .set("spark.executor.instances", 2) 
       ) 
    request.addfinalizer(lambda: sc.stop()) 

    sc = SparkContext(conf=conf) 
    return sc 

@pytest.fixture(scope="session") 
def hive_context(spark_context): 
    return HiveContext(spark_context) 

@pytest.fixture(scope="session") 
def streaming_context(spark_context): 
    return StreamingContext(spark_context, 1) 

現在,您可以通過調用py.test -m spark_yarn通過調用py.test -m spark_local和紗線模式以本地模式運行測試。

WORDCOUNT例如

在同一文件夾中創建三個文件:conftest.py(上圖),wordcount.py:

def do_word_counts(lines): 
    counts = (lines.flatMap(lambda x: x.split()) 
        .map(lambda x: (x, 1)) 
        .reduceByKey(lambda x, y: x+y) 
      ) 
    results = {word: count for word, count in counts.collect()} 
    return results 

而且wordcount_test.py:

import pytest 
import wordcount 

pytestmark = pytest.mark.usefixtures("spark_context") 

def test_do_word_counts(spark_context): 
    test_input = [ 
     ' hello spark ', 
     ' hello again spark spark' 
    ] 

    input_rdd = spark_context.parallelize(test_input, 1) 
    results = wordcount.do_word_counts(input_rdd) 

    expected_results = {'hello':2, 'spark':3, 'again':1} 
    assert results == expected_results 

現在你可以通過調用pytest來運行測試。

+0

這太棒了。謝謝。一個問題:不,如果我有一個更大的項目,我想在幾個文件夾中組織我的火花測試;我現在如何管理conftest.py的工作,因爲它似乎在同一個文件夾中有重要的地方。 –

相關問題