Replace empty strings with None/null values in DataFrame

spark dataframe replace null with empty string
pyspark filter empty string
how to handle null values in pyspark
pyspark replace null with 0 in a column
pyspark create dataframe with null value
spark dataframe replace null with mean
pandas fill blanks with 0
pyspark is null or empty

I have a Spark 1.5.0 DataFrame with a mix of null and empty strings in the same column. I want to convert all empty strings in all columns to null (None, in Python). The DataFrame may have hundreds of columns, so I'm trying to avoid hard-coded manipulations of each column.

See my attempt below, which results in an error.

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

## Create a test DataFrame
testDF = sqlContext.createDataFrame([Row(col1='foo', col2=1), Row(col1='', col2=2), Row(col1=None, col2='')])
testDF.show()
## +----+----+
## |col1|col2|
## +----+----+
## | foo|   1|
## |    |   2|
## |null|null|
## +----+----+

## Try to replace an empty string with None/null
testDF.replace('', None).show()
## ValueError: value should be a float, int, long, string, list, or tuple

## A string value of null (obviously) doesn't work...
testDF.replace('', 'null').na.drop(subset='col1').show()
## +----+----+
## |col1|col2|
## +----+----+
## | foo|   1|
## |null|   2|
## +----+----+

It is as simple as this:

from pyspark.sql.functions import col, when

def blank_as_null(x):
    return when(col(x) != "", col(x)).otherwise(None)

dfWithEmptyReplaced = testDF.withColumn("col1", blank_as_null("col1"))

dfWithEmptyReplaced.show()
## +----+----+
## |col1|col2|
## +----+----+
## | foo|   1|
## |null|   2|
## |null|null|
## +----+----+

dfWithEmptyReplaced.na.drop().show()
## +----+----+
## |col1|col2|
## +----+----+
## | foo|   1|
## +----+----+

If you want to fill multiple columns you can for example reduce:

to_convert = set([...]) # Some set of columns

reduce(lambda df, x: df.withColumn(x, blank_as_null(x)), to_convert, testDF)

or use comprehension:

exprs = [
    blank_as_null(x).alias(x) if x in to_convert else x for x in testDF.columns]

testDF.select(*exprs)

If you want to specifically operate on string fields please check the answer by robin-loxley.

Replace empty strings with None/null values in DataFrame, It is as simple as this: from pyspark.sql.functions import col, when def blank_as_null(x): return when(col(x) != "", col(x)).otherwise(None)  I have a Spark 1.5.0 DataFrame with a mix of null and empty strings in the same column. I want to convert all empty strings in all columns to null (None, in Python). The DataFrame may have hundreds of columns, so I'm trying to avoid hard-coded manipulations of each column.

My solution is much better than all the solutions I'v seen so far, which can deal with as many fields as you want, see the little function as the following:

  // Replace empty Strings with null values
  private def setEmptyToNull(df: DataFrame): DataFrame = {
    val exprs = df.schema.map { f =>
      f.dataType match {
        case StringType => when(length(col(f.name)) === 0, lit(null: String).cast(StringType)).otherwise(col(f.name)).as(f.name)
        case _ => col(f.name)
      }
    }

    df.select(exprs: _*)
  }

You can easily rewrite the function above in Python.

I learned this trick from @liancheng

How to replace each empty string in a pandas DataFrame with NaN , replace(pattern, value, regex=True) with pattern as r"^\s*$" and value as numpy.​NaN to replace and empty strings or strings containing only spaces with NaN . df =  Your first approach seams to fail due to a bug that prevents replace from being able to replace values with nulls, see here.. Your second approach fails because you're confusing driver-side Scala code for executor-side Dataframe instructions: your if-else expression would be evaluated once on the driver (and not per record); You'd want to replace it with a call to when function; Moreover, to

Simply add on top of zero323's and soulmachine's answers. To convert for all StringType fields.

from pyspark.sql.types import StringType
string_fields = []
for i, f in enumerate(test_df.schema.fields):
    if isinstance(f.dataType, StringType):
        string_fields.append(f.name)

How can I replace values with 'none' in a dataframe using pandas , How can I replace values with 'none' in a dataframe using pandas And even for larger replacements, it is always obvious and clear what is replaced by code to find a palindrome in python without using string functions? Is there any method to replace values with None in Pandas in Python? You can use df.replace('pre', 'post') and can replace a value with another, but this can't be done if you want to replace with None value, which if you try, you get a strange result. So here's an example: df = DataFrame(['-',3,2,5,1,-5,-1,'-',9]) df.replace('-', 0)

UDFs are not terribly efficient. The correct way to do this using a built-in method is:

df = df.withColumn('myCol', when(col('myCol') == '', None).otherwise(col('myCol')))

How to replace all blank/empty cells in a pandas dataframe with , How do I handle NaN values in a Pandas Dataframe? If the cells contain empty strings, df.replace(“”, numpy.nan, inplace=True) should do it. First check which data is missing in our data set, if yes we gonna fix this issue , but if no missing  This converts every value in the entire dataframe to NA, even though there are values besides empty strings. Then I tried to use mutate_all : replace.empty <- function(a) { a[a==""] <- NA } #dplyr pipe df %>% mutate_all(funs(replace.empty))

This is a different version of soulmachine's solution, but I don't think you can translate this to Python as easily:

def emptyStringsToNone(df: DataFrame): DataFrame = {
  df.schema.foldLeft(df)(
    (current, field) =>
      field.dataType match {
        case DataTypes.StringType =>
          current.withColumn(
            field.name,
            when(length(col(field.name)) === 0, lit(null: String)).otherwise(col(field.name))
          )
        case _ => current
      }
  )
}

Python, Heap in Python · Python - Initialize empty array of given length · Read JSON file using Python Python | Pandas DataFrame.fillna() to replace Null values in dataframe DataFrame.fillna(value=None, method=None, axis=None, inplace=​False, limit=None, Input can be 0 or 1 for Integer and 'index' or 'columns' for String You can use DataFrame.fillna or Series.fillna which will replace the Python object None, not the string 'None'. import pandas as pd For dataframe: df.fillna(value=pd.np.nan, inplace=True) For column or series: df.mycol.fillna(value=pd.np.nan, inplace=True)

pandas.DataFrame.notnull, Non-missing values get mapped to True. Characters such as empty strings '' or numpy.inf are not considered NA values (unless you set pandas  Some of these blank values are empty and some contain a (variable number) of spaces '', ' ', ' ', etc. Using the suggestion from this thread I have . df.replace(r'\s+', np.nan, regex=True, inplace = True) which does replace all the strings that only contain spaces, but also replaces every string that has a space in it, which is not what I want.

pandas.DataFrame.fillna, Object with missing values filled or None if inplace=True . See also Replace all NaN elements in column 'A', 'B', 'C', and 'D', with 0, 1, 2, and 3 respectively. Elle July 20, 2018, 11:38am #1. I have imported a csv sheet (319 columns x 45 rows). The dataset is highly confidential so I can't post any part of it. The class is a data.frame. There are a large number of "Null" values spread across all of the columns. The senior manager wants all the "Null" values converted to -9.

Spark's Treatment of Empty Strings and Blank Values in CSV Files, Blank CSV values were incorrectly loaded into Spark 2.0.0 DataFrames as empty strings and as empty strings and empty strings were read into DataFrames as null . getOrElse(None)val flowersPath = s"$homePath/Desktop/flowers.csv"val on Medium — and support writers while you're at it. Just $5/month. Upgrade  I found the solution using replace with a dict the most simple and elegant solution: df.replace({'-': None}) You can also have more replacements: df.replace({'-': None, 'None': None}) And even for larger replacements, it is always obvious and clear what is replaced by what - which is way harder for long lists, in my opinion.

Comments
  • @palsch, No, it doesn't return a list. It returns a DataFrame. I updated the question with a link to the Spark documentation.
  • @palsch it's not a general Python question! Spark DataFrames are distributed data structure used generally to allow heavy data analysis on big data. So you're solution isn't fit.
  • @eliasah Truth be told Pythonic lambda x: None if not x else x wrapped with udf would work just fine :)
  • @zero323 but he asked the OP to return a list...
  • Which of the answers is most efficient?
  • Thanks @zero323. Can your answer be extended to handle many columns automatically and efficiently? Perhaps list all the column names, generate similar code as your answer for each column, and then evaluate the code?
  • I don't see any reason why you couldn't. DataFrames are lazily evaluated and the rest is just a standard Python. You'll find some options in the edit.
  • I'll accept this answer, but could you please add the bit from @RobinLoxley first? Or, if you don't mind I can edit your answer.
  • @dnlbrky It wouldn't be fair.
  • The statement .otherwise(None) is not necessary. None is always returned for unmatched conditions (see spark.apache.org/docs/latest/api/python/…)
  • I'm getting a 'str' not callable error on this. Any ideas why?
  • check your parentheses
  • hmm I copied directly from here.
  • I just tested the code and it is valid. The error is likely introduced somewhere else in the manipulation of the dataframe and the error is raised only after an "Action" like collect() or show(). Do you get the same error if you do not include my code and run df.show()?