0

我正在嘗試PySpark的機器學習教程。嘗試打印數據集表時出現問題

以下this tutorial here

當我進入「相關性和數據準備」部分時出現問題。

試圖這裏運行此代碼:

from pyspark.sql.types import DoubleType 
from pyspark.sql.functions import UserDefinedFunction 

binary_map = {'Yes':1.0, 'No':0.0, 'True':1.0, 'False':0.0} 
toNum = UserDefinedFunction(lambda k: binary_map[k], DoubleType()) 

CV_data = CV_data.drop('State').drop('Area code') \ 
    .drop('Total day charge').drop('Total eve charge') \ 
    .drop('Total night charge').drop('Total intl charge') \ 
    .withColumn('Churn', toNum(CV_data['Churn'])) \ 
    .withColumn('International plan', toNum(CV_data['International plan'])) \ 
    .withColumn('Voice mail plan', toNum(CV_data['Voice mail plan'])).cache() 


final_test_data = final_test_data.drop('State').drop('Area code') \ 
    .drop('Total day charge').drop('Total eve charge') \ 
    .drop('Total night charge').drop('Total intl charge') \ 
    .withColumn('Churn', toNum(final_test_data['Churn'])) \ 
    .withColumn('International plan', toNum(final_test_data['International plan'])) \ 
    .withColumn('Voice mail plan', toNum(final_test_data['Voice mail plan'])).cache() 

這是打印終端()上的錯誤消息。

17/06/20 17:58:53 WARN BlockManager: Putting block rdd_38_0 failed due to an exception 
17/06/20 17:58:53 WARN BlockManager: Block rdd_38_0 could not be removed as it was not found on disk or in memory 
17/06/20 17:58:53 WARN BlockManager: Putting block rdd_53_0 failed due to an exception 
17/06/20 17:58:53 WARN BlockManager: Block rdd_53_0 could not be removed as it was not found on disk or in memory 
17/06/20 17:58:53 ERROR Executor: Exception in task 0.0 in stage 14.0 (TID 16) 
org.apache.spark.api.python.PythonException: Traceback (most recent call last): 
    File "/home/main/spark-2.1.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 174, in main 
    process() 
    File "/home/main/spark-2.1.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 169, in process 
serializer.dump_stream(func(split_index, iterator), outfile) 
    File "/home/main/spark-2.1.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 106, in <lambda> 
    func = lambda _, it: map(mapper, it) 
    File "<string>", line 1, in <lambda> 
    File "/home/main/spark-2.1.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 70, in <lambda> 
    return lambda *a: f(*a) 
    File "<stdin>", line 1, in <lambda> 
KeyError: False 

    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193) 
    at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234) 
    at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152) 
    .... 

錯誤消息的其餘部分可以從this document here觀看。

有誰知道這是什麼問題?

在此先感謝。

回答

1

[解決]

我引用this thread from 2 months back後解決它。

主要問題是上面提到的@ user6910411。這是一個數據類型錯誤。

由於我沒有必要打印出所有的數據作爲數字,我排除了最後3行代碼的變量CV_datafinal_test_data從教程網站:

從排除CV_data

.withColumn('Churn', toNum(CV_data['Churn'])) \ 
.withColumn('International plan', toNum(CV_data['International plan'])) \ 
.withColumn('Voice mail plan', toNum(CV_data['Voice mail plan'])).cache() 

final_test_data排除:

.withColumn('Churn', toNum(final_test_data['Churn'])) \ 
.withColumn('International plan', toNum(final_test_data['International plan'])) \ 
.withColumn('Voice mail plan', toNum(final_test_data['Voice mail plan'])).cache() 

表打印出來:

>>> pd.DataFrame(CV_data.take(5), columns=CV_data.columns).transpose() 
17/06/21 13:49:54 WARN Executor: 1 block locks were not released by TID = 11: 
[rdd_16_0] 
          0  1  2  3  4 
Account length   128 107 137  84  75 
International plan   No  No  No Yes Yes 
Voice mail plan   Yes Yes  No  No  No 
Number vmail messages  25  26  0  0  0 
Total day minutes  265.1 161.6 243.4 299.4 166.7 
Total day calls   110 123 114  71 113 
Total eve minutes  197.4 195.5 121.2 61.9 148.3 
Total eve calls   99 103 110  88 122 
Total night minutes  244.7 254.4 162.6 196.9 186.9 
Total night calls   91 103 104  89 121 
Total intl minutes   10 13.7 12.2 6.6 10.1 
Total intl calls   3  3  5  7  3 
Customer service calls  1  1  0  2  3 
Churn     False False False False False