pyspark加入多個條件

我要問，如果你對我怎麼可以在 pyspark當我使用。加入（註明很多條件的任何想法）pyspark加入多個條件

例子：與蜂巢：

query= "select a.NUMCNT,b.NUMCNT as RNUMCNT ,a.POLE,b.POLE as RPOLE,a.ACTIVITE,b.ACTIVITE as RACTIVITE FROM rapexp201412 b \ 
    join rapexp201412 a where (a.NUMCNT=b.NUMCNT and a.ACTIVITE = b.ACTIVITE and a.POLE =b.POLE )\

但在pyspark我不知道如何使它，因爲以下內容：

df_rapexp201412.join(df_aeveh,df_rapexp2014.ACTIVITE==df_rapexp2014.ACTIVITE and df_rapexp2014.POLE==df_aeveh.POLE,'inner')

不起作用！

來源

2015-12-02 malouke

你能PLZ粘貼DataFrame.join錯誤信息？或者嘗試在RDD中使用keyBy/join，它支持很好的等連接條件。 –

從火花文檔引用：

（https://spark.apache.org/docs/1.5.2/api/python/pyspark.sql.html?highlight=dataframe%20join#pyspark.sql.DataFrame.join）

join(other, on=None, how=None) Joins with another DataFrame, using the given join expression.

The following performs a full outer join between df1 and df2.

Parameters: other – Right side of the join on – a string for join column name, a list of column names, , a join expression (Column) or a list of Columns. If on is a string or a list of string indicating the name of the join column(s), the column(s) must exist on both sides, and this performs an inner equi-join. how – str, default ‘inner’. One of inner, outer, left_outer, right_outer, semijoin.

>>> df.join(df2, df.name == df2.name, 'outer').select(df.name, df2.height).collect() 
[Row(name=None, height=80), Row(name=u'Alice', height=None), Row(name=u'Bob', height=85)] 


>>> cond = [df.name == df3.name, df.age == df3.age] 
>>> df.join(df3, cond, 'outer').select(df.name, df3.age).collect() 
[Row(name=u'Bob', age=5), Row(name=u'Alice', age=2)]

所以你需要使用「條件作爲列表」選項，就像在最後一個例子。

來源

2015-12-25 14:10:34 user3689574

>>> cond = [df.name == df3.name, df.age == df3.age] 
>>> df.join(df3, cond, 'outer').select(df.name, df3.age).collect() 
[Row(name=u'Bob', age=5), Row(name=u'Alice', age=2)]

這並沒有與pyspark 1.3.1工作。我是越來越「Asse田：joinExprs應列」

相反，我用原始SQL加入數據幀如下圖所示

df.registerTempTable("df") 
df3.registerTempTable("df3") 

sqlContext.sql("Select df.name,df3.age from df outer join df3 on df.name = df3.name and df.age =df3.age").collect()

來源

2016-03-31 12:04:42 Mohan

pyspark加入多個條件

回答

相關問題