我有以下兩個數據集火花SQL連接兩個dataframes /數據集有相同的列名
controlSetDF : has columns loan_id, merchant_id, loan_type, created_date, as_of_date
accountDF : has columns merchant_id, id, name, status, merchant_risk_status
我使用Java的火花API加入他們,我需要在最終的數據集中只有特定的列
private String[] control_set_columns = {"loan_id", "merchant_id", "loan_type"};
private String[] sf_account_columns = {"id as account_id", "name as account_name", "merchant_risk_status"};
controlSetDF.selectExpr(control_set_columns)
.join(accountDF.selectExpr(sf_account_columns),controlSetDF.col("merchant_id").equalTo(accountDF.col("merchant_id")),
"left_outer");
,但我得到以下錯誤
org.apache.spark.sql.AnalysisException: resolved attribute(s) merchant_id#3L missing from account_name#131,loan_type#105,account_id#130,merchant_id#104L,loan_id#103,merchant_risk_status#2 in operator !Join LeftOuter, (merchant_id#104L = merchant_id#3L);;!Join LeftOuter, (merchant_id#104L = merchant_id#3L)
似乎是一個問題,因爲這兩個dataframes已經MERCHANT_ID柱。
注意:如果我不使用.selectExpr()它工作正常。但它會顯示第一個和第二個數據集的所有列。
謝謝@Silvio。這工作。 – NewQueries