Drop if all entries in a spark dataframe's specific column is null

spark drop null columns
spark dataframe drop rows
spark dataframe skip rows
remove empty rows in dataframe python
pyspark delete dataframe
how to remove empty rows in r
spark dataframe filter
drop columns from pyspark dataframe

Using Pyspark, how can I select/keep all columns of a DataFrame which contain a non-null value; or equivalently remove all columns which contain no data.

Edited: As per Suresh Request,

for column in media.columns:
    if media.select(media[column]).distinct().count() == 1:
        media = media.drop(media[column])

Here I assumed that if count is one, then it should be Nan. But I wanted to check whether that is Nan. And if there's any other inbuilt spark function, let me know.

One of the indirect way to do so is

import pyspark.sql.functions as func

for col in sdf.columns:
if (sdf.filter(func.isnan(func.col(col)) == True).count() == sdf.select(func.col(col)).count()):
    sdf = sdf.drop(col) 

Update: The above code drops columns with all nan. If you are looking for all nulls then

import pyspark.sql.functions as func

for col in sdf.columns:
if (sdf.filter(func.col(col).isNull()).count() == sdf.select(func.col(col)).count()):
    sdf = sdf.drop(col)

Will update my answer if I find some optimal way :-)

Usually, in SQL, you need to check on every column if the value is null in order to drop however, Spark provides a function drop() in  A SparkSQL DataFrame. how "any" or "all". if "any", drop a row if it contains any nulls. if "all", drop a row only if all its values are null. if minNonNulls is specified, how is ignored.

I tried my way.let me know if it works,

Say, I have a dataframe as below,

>>> df.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
|   1|   2|null|
|null|   3|null|
|   5|null|null|
+----+----+----+

>>> df1 = df.agg(*[F.count(c).alias(c) for c in df.columns])
>>> df1.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
|   2|   2|   0|
+----+----+----+

>>> nonNull_cols = [c for c in df1.columns if df1[[c]].first()[c] > 0]
>>> df = df.select(*nonNull_cols)
>>> df.show()
+----+----+
|col1|col2|
+----+----+
|   1|   2|
|null|   3|
|   5|null|
+----+----+

drop. Returns a new DataFrame that drops rows containing null or NaN values in the specified columns. If how is "any", then drop rows containing any null or NaN values in the specified columns. If how is "all", then drop rows only if every specified column is null or NaN for that row. Deduplicating and Collapsing Records in Spark DataFrames mrpowers October 6, 2018 0 This blog post explains how to filter duplicate records from Spark DataFrames with the dropDuplicates() and killDuplicates() methods.

This is a function I have in my pipeline to remove null columns. Hope it helps!

# Function to drop the empty columns of a DF
def dropNullColumns(df):
    # A set of all the null values you can encounter
    null_set = {"none", "null" , "nan"}
    # Iterate over each column in the DF
    for col in df.columns:
        # Get the distinct values of the column
        unique_val = df.select(col).distinct().collect()[0][0]
        # See whether the unique value is only none/nan or null
        if str(unique_val).lower() in null_set:
            print("Dropping " + col + " because of all null values.")
            df = df.drop(col)
    return(df)

Returns a new DataFrame that drops rows containing null or NaN values. If how is "any", then drop rows containing any null or NaN values. If how is "all", then  Drop rows from the dataframe based on certain condition applied on a column Pandas provides a rich collection of functions to perform data analysis in Python. While performing data analysis, quite often we require to filter the data to remove unnecessary rows or columns.

for me it worked in a bit different way than @Suresh answer:

nonNull_cols = [c for c in original_df.columns if original_df.filter(func.col(c).isNotNull()).count() > 0]
new_df = original_df.select(*nonNull_cols)

Look at all those empty cells. Shame. Let's deal with these trouble makers. Dropping Rows With Empty Values. If you're a Pandas fan, you're probably thinking  DataFrame.drop(self, labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise') [source] ¶. Drop specified labels from rows or columns. Remove rows or columns by specifying label names and corresponding axis, or by specifying directly index or column names. When using a multi-index, labels on different levels can be removed by specifying the level.

Or just

from pyspark.sql.functions import col

for c in df.columns:
    if df.filter(col(c).isNotNull()).count() == 0:
      df = df.drop(c)

Cleaning your Pandas Dataframes: dropping empty or problematic data. Learn the basic How: Accepts one of two possible values: any or all. ‘all’ : drop if all the values are missing / NaN. thresh: threshold for non NaN values. inplace: If True then make changes in the dataplace itself. It removes rows or columns (based on arguments) with missing values / NaN. Let’s use dropna () function to remove rows with missing values in a dataframe, Suppose we have a dataframe i.e.

The immediate deletion aspect of the PURGE clause could be significant in cases such as: If the cluster is running low on storage space and it is important to free  Returns a new DataFrame that drops rows containing null values in the specified columns. If how is "any", then drop rows containing any null values in the specified columns. If how is "all", then drop rows only if every specified column is null for that row.

Deduplicating and Collapsing Records in Spark DataFrames information on the ArrayType columns that are returned when DataFrames are collapsed. method to completely remove all duplicates from a DataFrame. In this article we will discuss how to delete rows based in DataFrame by checking multiple conditions on column values. DataFrame provides a member function drop () i.e. It accepts a single or list of label names and deletes the corresponding rows or columns (based on value of axis parameter i.e. 0 for rows or 1 for columns).

Y: dataframe with count of nan/null for each column. Jul 17, 2019 How to Identify and Drop Null Values Data Science Tutorials 7,090 views. class pyspark.sql.SparkSession(sparkContext, jsparkSession=None)¶. The entry point to programming Spark with the Dataset and DataFrame API. A SparkSession can be used create DataFrame, register DataFrame as tables, execute SQL over tables, cache tables, and read parquet files.

Comments
  • Possible duplicate of Difference between na().drop() and filter(col.isNotNull) (Apache Spark)
  • This is about removing columns, not rows.
  • So, you got to remove even if a column has one null value or all values as null ?? can you post what you have tried along with input and output samples .
  • I think it should work. If all values of a column is null, I believe datatype won't matter. Just try and let us know.
  • @Abhisek Your function also drops columns that have one distinct value. Try your function with the following example data. data_2 = { 'furniture': [np.NaN ,np.NaN ,True], 'myid': ['1-12', '0-11', '2-12'], 'clothing': ["pants", "shoes", "socks"]} df_1 = pd.DataFrame(data_2) ddf_1 = spark.createDataFrame(df_1) You will see that the furniture column will be dropped although in fact it should not be dropped.
  • this code still leaves columns containing all zeros.