過濾pyspark數據幀，以保持至少含有1空值（守，不降）行

假設我有以下pyspark數據框：過濾pyspark數據幀，以保持至少含有1空值（守，不降）行

>>> df = spark.createDataFrame([('A', 'Amsterdam', 3.4), ('B', 'London', None), ('C', None, None), ('D', None, 11.1)], ['c1', 'c2', 'c3']) 
>>> df.show() 
+---+---------+----+ 
| c1|  c2| c3| 
+---+---------+----+ 
| A|Amsterdam| 3.4| 
| B| London|null| 
| C|  null|null| 
| D|  null|11.1| 
+---+---------+----+

我怎麼能現在選擇或任何行篩選，含至少一個空值，像這樣？：

>>> df.SOME-COMMAND-HERE.show() 
+---+---------+----+ 
| c1|  c2| c3| 
+---+---------+----+ 
| B| London|null| 
| C|  null|null| 
| D|  null|11.1| 
+---+---------+----+

來源

2016-12-25 Ytsen de Boer

的可能的複製[如何過濾掉火花數據框中空值（http://stackoverflow.com/questions/39727742/how-to-filter-out -a-null-value-from-spark-dataframe） –

不，這不是一回事。在那裏，他們想要**過濾掉任何包含空值**的特定列的行**。在這裏我想**過濾**任何包含**至少一個**空值的行。 –

您可以使用

df.drapna()

或

df.na.drop()

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html

dropna（如何= '任意'，脫粒=無，子集=無）

Returns a new DataFrame omitting rows with null values. DataFrame.dropna() and DataFrameNaFunctions.drop() are aliases of each other. 
Parameters: 

    how – ‘any’ or ‘all’. If ‘any’, drop a row if it contains any nulls. If ‘all’, drop a row only if all its values are null. 
    thresh – int, default None If specified, drop rows that have less than thresh non-null values. This overwrites the how parameter. 
    subset – optional list of column names to consider. 

>>> df4.na.drop().show() 
+---+------+-----+ 
|age|height| name| 
+---+------+-----+ 
| 10| 80|Alice| 
+---+------+-----+

來源

2016-12-25 12:50:11 Yaron

我不想*刪除包含空值的行，我想檢查它們。我編輯了這個問題，使其更加清晰。 –

構建一個適當的生SQL查詢並應用：

# Create the data frame 
df = spark.createDataFrame([('A', 'Amsterdam', 3.4), ('B', 'London', None), ('C', None, None), ('D', None, 11.1)], ['c1', 'c2', 'c3']) 
df.show() 
+---+---------+----+ 
| c1|  c2| c3| 
+---+---------+----+ 
| A|Amsterdam| 3.4| 
| B| London|null| 
| C|  null|null| 
| D|  null|11.1| 
+---+---------+----+ 

# Compose the approprate raw SQL query 
sql_query_base = 'SELECT * FROM df WHERE ' 
sql_query_apps = ['{} IS NULL'.format(col_name) for col_name in df.columns] 
sql_query = str_base + ' OR '.join(sql_query_apps) 
sql_query 
'SELECT * FROM df WHERE c1 IS NULL OR c2 IS NULL OR c3 IS NULL' 

# Register the dataframe as a SQL table 
sqlContext.registerDataFrameAsTable(df, 'df') 

# Apply raw SQL 
sqlContext.sql(sql_query).show() 
+---+------+----+ 
| c1| c2| c3| 
+---+------+----+ 
| B|London|null| 
| C| null|null| 
| D| null|11.1| 
+---+------+----+

來源

2016-12-29 10:28:18

Creat e從原始數據幀中刪除所需的行。然後，從原來的「減」是：

# Create the data frame 
df = spark.createDataFrame([('A', 'Amsterdam', 3.4), ('B', 'London', None), ('C', None, None), ('D', None, 11.1)], ['c1', 'c2', 'c3']) 
df.show() 
+---+---------+----+ 
| c1|  c2| c3| 
+---+---------+----+ 
| A|Amsterdam| 3.4| 
| B| London|null| 
| C|  null|null| 
| D|  null|11.1| 
+---+---------+----+ 

# Construct an intermediate dataframe without the desired rows 
df_drop = df.dropna('any') 
df_drop.show() 
+---+---------+---+ 
| c1|  c2| c3| 
+---+---------+---+ 
| A|Amsterdam|3.4| 
+---+---------+---+ 

# Then subtract it from the original to reveal the desired rows 
df.subtract(df_drop).show() 
+---+------+----+ 
| c1| c2| c3| 
+---+------+----+ 
| B|London|null| 
| C| null|null| 
| D| null|11.1| 
+---+------+----+

來源

2016-12-29 11:37:36

過濾pyspark數據幀，以保持至少含有1空值（守，不降）行

回答

相關問題