2016-11-04 73 views
2

我想要使用列表過濾pyspark中的數據幀。我想要根據列表進行過濾,或者只包含那些列表中有值的記錄。我下面的代碼不起作用:pyspark數據框過濾器或包括基於列表

# define a dataframe 
rdd = sc.parallelize([(0,1), (0,1), (0,2), (1,2), (1,10), (1,20), (3,18), (3,18), (3,18)]) 
df = sqlContext.createDataFrame(rdd, ["id", "score"]) 

# define a list of scores 
l = [10,18,20] 

# filter out records by scores by list l 
records = df.filter(df.score in l) 
# expected: (0,1), (0,1), (0,2), (1,2) 

# include only records with these scores in list l 
records = df.where(df.score in l) 
# expected: (1,10), (1,20), (3,18), (3,18), (3,18) 

提供了以下錯誤: ValueError異常:無法轉換成列布爾:請用「&」爲「和」,「|」爲'或','〜'爲'不'時構建DataFrame布爾表達式。

回答

7

了據稱是「df.score交運集團」無法評估,因爲df.score給你一個欄和「」是不是該列使用類型「ISIN」

定義的代碼應該是像這樣:

# define a dataframe 
rdd = sc.parallelize([(0,1), (0,1), (0,2), (1,2), (1,10), (1,20), (3,18), (3,18), (3,18)]) 
df = sqlContext.createDataFrame(rdd, ["id", "score"]) 

# define a list of scores 
l = [10,18,20] 

# filter out records by scores by list l 
records = df.filter(~df.score.isin(l)) 
# expected: (0,1), (0,1), (0,2), (1,2) 

# include only records with these scores in list l 
df.where(df.score.isin(l)) 
# expected: (1,10), (1,20), (3,18), (3,18), (3,18) 
相關問題