在斯卡拉星火添加一列,我可以輕鬆地添加一列到現有的數據框寫入從另一個數據幀
val newDf = df.withColumn("date_min", anotherDf("date_min"))
在AnalysisException
在PySpark結果這樣做。
下面是我在做什麼:
minDf.show(5)
maxDf.show(5)
+--------------------+
| date_min|
+--------------------+
|2016-11-01 10:50:...|
|2016-11-01 11:46:...|
|2016-11-01 19:23:...|
|2016-11-01 17:01:...|
|2016-11-01 09:00:...|
+--------------------+
only showing top 5 rows
+--------------------+
| date_max|
+--------------------+
|2016-11-01 10:50:...|
|2016-11-01 11:46:...|
|2016-11-01 19:23:...|
|2016-11-01 17:01:...|
|2016-11-01 09:00:...|
+--------------------+
only showing top 5 rows
,然後什麼導致一個錯誤:
newDf = minDf.withColumn("date_max", maxDf["date_max"])
AnalysisExceptionTraceback (most recent call last)
<ipython-input-13-7e19c841fa51> in <module>()
2 maxDf.show(5)
3
----> 4 newDf = minDf.withColumn("date_max", maxDf["date_max"])
/opt/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/dataframe.pyc in withColumn(self, colName, col)
1491 """
1492 assert isinstance(col, Column), "col should be Column"
-> 1493 return DataFrame(self._jdf.withColumn(colName, col._jc), self.sql_ctx)
1494
1495 @ignore_unicode_prefix
/opt/spark-2.1.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py in __call__(self, *args)
1131 answer = self.gateway_client.send_command(command)
1132 return_value = get_return_value(
-> 1133 answer, self.gateway_client, self.target_id, self.name)
1134
1135 for temp_arg in temp_args:
/opt/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/utils.pyc in deco(*a, **kw)
67 e.java_exception.getStackTrace()))
68 if s.startswith('org.apache.spark.sql.AnalysisException: '):
---> 69 raise AnalysisException(s.split(': ', 1)[1], stackTrace)
70 if s.startswith('org.apache.spark.sql.catalyst.analysis'):
71 raise AnalysisException(s.split(': ', 1)[1], stackTrace)
AnalysisException: u'resolved attribute(s) date_max#67 missing from date_min#66 in operator !Project [date_min#66, date_max#67 AS date_max#106];;\n!Project [date_min#66, date_max#67 AS date_max#106]\n+- Project [date_min#66]\n +- Project [cast((cast(date_min#6L as double)/cast(1000 as double)) as timestamp) AS date_min#66, cast((cast(date_max#7L as double)/cast(1000 as double)) as timestamp) AS date_max#67]\n +- SubqueryAlias df, `df`\n +- LogicalRDD [idvisiteur#5, date_min#6L, date_max#7L, sales_sum#8, sales_count#9L]\n'
你可以嘗試加入,如果有DF相同長度 –