2016-12-25 48 views
1

假設我有以下pyspark數據框:過濾pyspark數據幀,以保持至少含有1空值(守,不降)行

>>> df = spark.createDataFrame([('A', 'Amsterdam', 3.4), ('B', 'London', None), ('C', None, None), ('D', None, 11.1)], ['c1', 'c2', 'c3']) 
>>> df.show() 
+---+---------+----+ 
| c1|  c2| c3| 
+---+---------+----+ 
| A|Amsterdam| 3.4| 
| B| London|null| 
| C|  null|null| 
| D|  null|11.1| 
+---+---------+----+ 

我怎麼能現在選擇或任何行篩選,含至少一個空值,像這樣?:

>>> df.SOME-COMMAND-HERE.show() 
+---+---------+----+ 
| c1|  c2| c3| 
+---+---------+----+ 
| B| London|null| 
| C|  null|null| 
| D|  null|11.1| 
+---+---------+----+ 
+0

的可能的複製[如何過濾掉火花數據框中空值(http://stackoverflow.com/questions/39727742/how-to-filter-out -a-null-value-from-spark-dataframe) –

+0

不,這不是一回事。在那裏,他們想要**過濾掉任何包含空值**的特定列的行**。在這裏我想**過濾**任何包含**至少一個**空值的行。 –

回答

0

您可以使用

df.drapna() 

df.na.drop() 

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html

dropna(如何= '任意',脫粒=無,子集=無)

Returns a new DataFrame omitting rows with null values. DataFrame.dropna() and DataFrameNaFunctions.drop() are aliases of each other. 
Parameters: 

    how – ‘any’ or ‘all’. If ‘any’, drop a row if it contains any nulls. If ‘all’, drop a row only if all its values are null. 
    thresh – int, default None If specified, drop rows that have less than thresh non-null values. This overwrites the how parameter. 
    subset – optional list of column names to consider. 

>>> df4.na.drop().show() 
+---+------+-----+ 
|age|height| name| 
+---+------+-----+ 
| 10| 80|Alice| 
+---+------+-----+ 
+0

我不想*刪除包含空值的行,我想檢查它們。我編輯了這個問題,使其更加清晰。 –

0

構建一個適當的生SQL查詢並應用:

# Create the data frame 
df = spark.createDataFrame([('A', 'Amsterdam', 3.4), ('B', 'London', None), ('C', None, None), ('D', None, 11.1)], ['c1', 'c2', 'c3']) 
df.show() 
+---+---------+----+ 
| c1|  c2| c3| 
+---+---------+----+ 
| A|Amsterdam| 3.4| 
| B| London|null| 
| C|  null|null| 
| D|  null|11.1| 
+---+---------+----+ 

# Compose the approprate raw SQL query 
sql_query_base = 'SELECT * FROM df WHERE ' 
sql_query_apps = ['{} IS NULL'.format(col_name) for col_name in df.columns] 
sql_query = str_base + ' OR '.join(sql_query_apps) 
sql_query 
'SELECT * FROM df WHERE c1 IS NULL OR c2 IS NULL OR c3 IS NULL' 

# Register the dataframe as a SQL table 
sqlContext.registerDataFrameAsTable(df, 'df') 

# Apply raw SQL 
sqlContext.sql(sql_query).show() 
+---+------+----+ 
| c1| c2| c3| 
+---+------+----+ 
| B|London|null| 
| C| null|null| 
| D| null|11.1| 
+---+------+----+ 
0

Creat e從原始數據幀中刪除所需的行。然後,從原來的「減」是:

# Create the data frame 
df = spark.createDataFrame([('A', 'Amsterdam', 3.4), ('B', 'London', None), ('C', None, None), ('D', None, 11.1)], ['c1', 'c2', 'c3']) 
df.show() 
+---+---------+----+ 
| c1|  c2| c3| 
+---+---------+----+ 
| A|Amsterdam| 3.4| 
| B| London|null| 
| C|  null|null| 
| D|  null|11.1| 
+---+---------+----+ 

# Construct an intermediate dataframe without the desired rows 
df_drop = df.dropna('any') 
df_drop.show() 
+---+---------+---+ 
| c1|  c2| c3| 
+---+---------+---+ 
| A|Amsterdam|3.4| 
+---+---------+---+ 

# Then subtract it from the original to reveal the desired rows 
df.subtract(df_drop).show() 
+---+------+----+ 
| c1| c2| c3| 
+---+------+----+ 
| B|London|null| 
| C| null|null| 
| D| null|11.1| 
+---+------+----+