2017-06-21 92 views
0

我已經從sparkSQL創建兩個數據幀時:pyspark:AnalysisException接合兩個數據幀

df1 = sqlContext.sql(""" ...""") 
df2 = sqlContext.sql(""" ...""") 

我試圖加入在柱上my_id這兩個數據幀象下面這樣:

from pyspark.sql.functions import col 

combined_df = df1.join(df2, col("df1.my_id") == col("df2.my_id"), 'inner') 

然後我得到了以下錯誤。任何想法我錯過了什麼?謝謝!

AnalysisException       Traceback (most recent call last) 
<ipython-input-11-45f5313387cc> in <module>() 
     3 from pyspark.sql.functions import col 
     4 
----> 5 combined_df = df1.join(df2, col("df1.my_id") == col("df2.my_id"), 'inner') 
     6 combined_df.take(10) 

/usr/local/spark-latest/python/pyspark/sql/dataframe.py in join(self, other, on, how) 
    770     how = "inner" 
    771    assert isinstance(how, basestring), "how should be basestring" 
--> 772    jdf = self._jdf.join(other._jdf, on, how) 
    773   return DataFrame(jdf, self.sql_ctx) 
    774 

/usr/local/spark-latest/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py in __call__(self, *args) 
    1131   answer = self.gateway_client.send_command(command) 
    1132   return_value = get_return_value(
-> 1133    answer, self.gateway_client, self.target_id, self.name) 
    1134 
    1135   for temp_arg in temp_args: 

/usr/local/spark-latest/python/pyspark/sql/utils.py in deco(*a, **kw) 
    67            e.java_exception.getStackTrace())) 
    68    if s.startswith('org.apache.spark.sql.AnalysisException: '): 
---> 69     raise AnalysisException(s.split(': ', 1)[1], stackTrace) 
    70    if s.startswith('org.apache.spark.sql.catalyst.analysis'): 
    71     raise AnalysisException(s.split(': ', 1)[1], stackTrace) 

AnalysisException: "cannot resolve '`df1.my_id`' given input columns: [... 

回答

1

我覺得你的代碼問題是,你正試圖給「df1.my_id」作爲列名而不是僅僅是col('my_id')。這就是爲什麼錯誤說cannot resolve df1.my_id given input columns

你可以做到這一點,而無需導入col

combined_df = df1.join(df2, df1.my_id == df2.my_id, 'inner') 
0

不確定pyspark,但如果你在兩個dataframe

combineDf = df1.join(df2, 'my_id', 'outer') 

希望這有助於具有相同的字段名稱這應該工作!