replace values of one column in a spark df by dictionary key-values (pyspark)

pyspark replace
pyspark replace values in column with dictionary
pyspark replace multiple values
pyspark replace value on condition
spark dataframe replace value
spark dataframe replace all values
how to update column value in spark dataframe
pyspark dataframe

I got stucked with a data transformation task in pyspark. I want to replace all values of one column in a df with key-value-pairs specified in a dictionary.

dict = {'A':1, 'B':2, 'C':3}

My df looks like this:

+-----------++-----------+
|       col1||       col2|
+-----------++-----------+
|          B||          A|
|          A||          A|
|          A||          A|
|          C||          B|
|          A||          A|
+-----------++-----------+

Now I want to replace all values of col1 by the key-values pairs defined in dict.

Desired Output:

+-----------++-----------+
|       col1||       col2|
+-----------++-----------+
|          2||          A|
|          1||          A|
|          1||          A|
|          3||          B|
|          1||          A|
+-----------++-----------+

I tried

df.na.replace(dict, 1).show()

but that also replaces the values on col2, which shall stay untouched.

Thank you for your help. Greetings :)

Your data:

print df
DataFrame[col1: string, col2: string]
df.show()   
+----+----+
|col1|col2|
+----+----+
|   B|   A|
|   A|   A|
|   A|   A|
|   C|   B|
|   A|   A|
+----+----+

diz = {"A":1, "B":2, "C":3}

Convert values of your dictionary from integer to string, in order to not get errors of replacing different types:

diz = {k:str(v) for k,v in zip(diz.keys(),diz.values())}

print diz
{'A': '1', 'C': '3', 'B': '2'}

Replace value of col1

df2 = df.na.replace(diz,1,"col1")
print df2
DataFrame[col1: string, col2: string]

df2.show()
+----+----+
|col1|col2|
+----+----+
|   2|   A|
|   1|   A|
|   1|   A|
|   3|   B|
|   1|   A|
+----+----+

If you need to cast your values from String to Integer

from pyspark.sql.types import *

df3 = df2.select(df2["col1"].cast(IntegerType()),df2["col2"]) 
print df3
DataFrame[col1: int, col2: string]

df3.show()
+----+----+
|col1|col2|
+----+----+
|   2|   A|
|   1|   A|
|   1|   A| 
|   3|   B|
|   1|   A|
+----+----+

pyspark.sql module, I got stucked with a data transformation task in pyspark. I want to replace all values of one column in a df with key-value-pairs specified in a dictionary. dict  You cannot change data from already created dataFrame. In this article, we will check how to update spark dataFrame column values using pyspark. The same concept will be applied to Scala as well. How to Update Spark DataFrame Column Values using Pyspark? The Spark dataFrame is one of the widely used features in Apache Spark.

you can also create a simple lambda function to get the dictionary values and update your dataframe column.

+----+----+
|col1|col2|
+----+----+
|   B|   A|
|   A|   A|
|   A|   A|
|   A|   A|
|   C|   B|
|   A|   A|
+----+----+

dict = {'A':1, 'B':2, 'C':3}
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType

user_func =  udf (lambda x: dict.get(x), IntegerType())
newdf = df.withColumn('col1',user_func(df.col1))

>>> newdf.show();
+----+----+
|col1|col2|
+----+----+
|   2|   A|
|   1|   A|
|   1|   A|
|   1|   A|
|   3|   B|
|   1|   A|
+----+----+

I hope this also works !

pyspark.sql module, I got stucked with a data transformation task in pyspark. I want to replace all values of one column in a df with key-value-pairs specified in a dictionary. To do it only for non-null values of dataframe, you would have to filter non-null values of each column and replace your value. when can help you achieve this. from pyspark.sql.functions import when df.withColumn('c1', when(df.c1.isNotNull(), 1)) .withColumn('c2', when(df.c2.isNotNull(), 1)) .withColumn('c3', when(df.c3.isNotNull(), 1))

Before replacing the values of column 1 in my df, i had to automate the generation of my dictionary (given the many keys). This was done as follows:

keys =sorted(df.select('col1').rdd.flatMap(lambda x: x).distinct().collect())

keys
['A', 'B', 'C']

import numpy

maxval = len(keys)
values = list(numpy.array(list(range(maxval)))+1)

values
[1, 2, 3]

making sure (as titiro89 mentions above) that the type of the 'new' values is the same type as the 'old' values (string in this case)

dct = {k:str(v) for k,v in zip(keys,values)}
print(dct)

{'A': '1', 'B': '2', 'C': '3'}

df2 = df.replace(dct,1,"'col1'")

replace values of one column in a spark df by dictionary key-values, DataFrameNaFunctions Methods for handling missing data (null values). Gets an existing SparkSession or, if there is no existing one, creates a new one based on Returns the value of Spark SQL configuration property for the given key. exprs – a dict mapping from column name (string) to aggregate functions (string),​  The replacement value must be a bool, int, long, float, string or None. If value is a list, value should be of the same length and type as to_replace. If value is a scalar and to_replace is a sequence, then value is used as a replacement for each item in to_replace. subset – optional list of column names to consider. Columns specified in subset that do not have matching data type are ignored.

How to update a pyspark dataframe with new values from another , Create a DataFrame with single pyspark.sql.types.LongType column named id , containing elements in a range from start to end (exclusive) with step value step  For a DataFrame a dict can specify that different values should be replaced in different columns. For example, {'a': 1, 'b': 'z'} looks for the value 1 in column ‘a’ and the value ‘z’ in column ‘b’ and replaces these values with whatever is specified in value. The value parameter should not be None in this case. You can treat this as a special case of passing two lists except that you are specifying the column to search in.

4. Working with Key/Value Pairs, I got stucked with a data transformation task in pyspark. I want to replace all values of one column in a df with key-value-pairs specified in a  Spark DataFrame expand on a lot of these concepts, allowing you to transfer that knowledge easily by understanding the simple syntax of Spark DataFrames. Remember that the main advantage to using Spark DataFrames vs those other programs is that Spark can handle data across many RDDs, huge data sets that would never fit on a single computer.

5 Ways to add a new column in a PySpark Dataframe, Spark Dataframe Update Column Value We all know that UPDATING column to replace a string, regex, list, dictionary, series, number etc. from a dataframe. it creates two new columns one for key and one for value and each element in  @since (1.4) def replace (self, to_replace, value = _NoValue, subset = None): """Returns a new :class:`DataFrame` replacing a value with another value.:func:`DataFrame.replace` and :func:`DataFrameNaFunctions.replace` are aliases of each other. Values to_replace and value must have the same type and can only be numerics, booleans, or strings

Comments
  • I believe that your problem is a usecase for using spark broadcast variables. Check out spark.apache.org/docs/2.4.0/…
  • What if there is list of values against each key. How would I achieve that?
  • I think that the question in your comment should be an additional and different Stackoverflow question so you could provide specific examples of what you mean and receive a more accurate and complete answer