Scala: How can I replace value in Dataframes using scala

Related searches

For example I want to replace all numbers equal to 0.2 in a column to 0. How can I do that in Scala? Thanks

Edit:

|year| make|model| comment            |blank|
|2012|Tesla| S   | No comment         |     | 
|1997| Ford| E350|Go get one now th...|     | 
|2015|Chevy| Volt| null               | null| 

This is my Dataframe I'm trying to change Tesla in make column to S

Note: As mentionned by Olivier Girardot, this answer is not optimized and the withColumn solution is the one to use (Azeroth2b answer)

Can not delete this answer as it has been accepted


Here is my take on this one:

 val rdd = sc.parallelize(
      List( (2012,"Tesla","S"), (1997,"Ford","E350"), (2015,"Chevy","Volt"))
  )
  val sqlContext = new SQLContext(sc)

  // this is used to implicitly convert an RDD to a DataFrame.
  import sqlContext.implicits._

  val dataframe = rdd.toDF()

  dataframe.foreach(println)

 dataframe.map(row => {
    val row1 = row.getAs[String](1)
    val make = if (row1.toLowerCase == "tesla") "S" else row1
    Row(row(0),make,row(2))
  }).collect().foreach(println)

//[2012,S,S]
//[1997,Ford,E350]
//[2015,Chevy,Volt]

You can actually use directly map on the DataFrame.

So you basically check the column 1 for the String tesla. If it's tesla, use the value S for make else you the current value of column 1

Then build a tuple with all data from the row using the indexes (zero based) (Row(row(0),make,row(2))) in my example)

There is probably a better way to do it. I am not that familiar yet with the Spark umbrella

Scala: How can I replace value in Dataframes using scala, You can't mutate DataFrames, you can only transform them into new DataFrames with updated values. In this case - you can use the� How can I get better performance with DataFrame UDFs? If the functionality exists in the available built-in functions, using these will perform better. We use the built-in functions and the withColumn() API to add new columns. We could have also used withColumnRenamed() to replace an existing column after the transformation.

Spark 1.6.2, Java code (sorry), this will change every instance of Tesla to S for the entire dataframe without passing through an RDD:

dataframe.withColumn("make", when(col("make").equalTo("Tesla"), "S")
                             .otherwise(col("make") 
                    );

Edited to add @marshall245 "otherwise" to ensure non-Tesla columns aren't converted to NULL.

Spark Dataframe change column value, Now for all the c values should be replaced with cat and h which is the value of col3 should be replaced with hat. I can achieve this easily with� Scala vs. Groovy vs. Clojure [fermé] Supprimer les lignes avec tout ou partie des NAs (valeurs manquantes) dans les données.cadre Ajout d'une nouvelle colonne à la base de données existante dans les pandas de Python

Building off of the solution from @Azeroth2b. If you want to replace only a couple of items and leave the rest unchanged. Do the following. Without using the otherwise(...) method, the remainder of the column becomes null.

import org.apache.spark.sql.functions._
val newsdf = sdf.withColumn("make", when(col("make") === "Tesla", "S")
                                   .otherwise(col("make"))
                           );

Old DataFrame

+-----+-----+ 
| make|model| 
+-----+-----+ 
|Tesla|    S| 
| Ford| E350| 
|Chevy| Volt| 
+-----+-----+ 

New Datarame

+-----+-----+
| make|model|
+-----+-----+
|    S|    S|
| Ford| E350|
|Chevy| Volt|
+-----+-----+

Replacing values in the dataframe keeping original, I want to update value of gnk_id in file 1 with value of matchid in file 2 if file1.gnk_id = file2.gnk_machid. For this I created two data frame in Spark. I was wondering whether we can update values in Spark? If not, is there any workaround for this which will provide updated final file? UPDATE. I did something like this

This can be achieved in dataframes with user defined functions (udf).

import org.apache.spark.sql.functions._
val sqlcont = new org.apache.spark.sql.SQLContext(sc)
val df1 = sqlcont.jsonRDD(sc.parallelize(Array(
      """{"year":2012, "make": "Tesla", "model": "S", "comment": "No Comment", "blank": ""}""",
      """{"year":1997, "make": "Ford", "model": "E350", "comment": "Get one", "blank": ""}""",
      """{"year":2015, "make": "Chevy", "model": "Volt", "comment": "", "blank": ""}"""
    )))

val makeSIfTesla = udf {(make: String) => 
  if(make == "Tesla") "S" else make
}
df1.withColumn("make", makeSIfTesla(df1("make"))).show

programming matrix: Scala, I'm using the DataFrame df that you have defined earlier. val newDf = df.na.fill("e",Seq("blank")) DataFrame s are immutable structures. Each time you perform a transformation which you need to store, you'll need to affect the transformed DataFrame to a new value.

df2.na.replace("Name",Map("John" -> "Akshay","Cindy" -> "Jayita")).show()

replace in class DataFrameNaFunctions of type [T](col: String, replacement: Map[T,T])org.apache.spark.sql.DataFrame

For running this function you must have active spark object and dataframe with headers ON.

How can I get better performance with DataFrame UDFs? If the functionality exists in the available built-in functions, using these will perform better. We use the built-in functions and the withColumn() API to add new columns. We could have also used withColumnRenamed() to replace an existing column after the transformation.

Spark withColumn() function is used to rename, change the value, convert the datatype of an existing DataFrame column and also can be used to create a new column, on this post, I will walk you through commonly used DataFrame column operations with Scala examples.

DataFrame API provides DataFrameNaFunctions class with fill () function to replace null values on DataFrame. This function has several overloaded signatures that take different data types as parameters.

How does one use RDDs that were created in Python, in a Scala notebook? 1 Answer Can I connect to Couchbase using Python? 0 Answers Examples about Complex Event Processing (CEP) and other ways for searching complex sequential event patterns 0 Answers

Comments
  • by converting to RDD with .rdd and using map to change to 0 if 0.2 ?
  • What is the map command for change to 0 if 0.2?
  • And how can i focus on a specific column?
  • Give us an example of your data, what you have tried so far.
  • +----+-----+-----+--------------------+-----+ |year| make|model| comment|blank| +----+-----+-----+--------------------+-----+ |2012|Tesla| S| No comment| | |1997| Ford| E350|Go get one now th...| | |2015|Chevy| Volt| null| null| This is my Dataframe I'm trying to change Tesla in make column to S. I have just start learning Scala. Really appreciate your help!
  • Thanks for your help. I have one more question. Your solution can printout the strings I want. However what if I want to change the value within the dataframe itself? When I do dataframe.show() the the value is still tesla
  • Dataframe are based on RDDs which are immutable. Try val newDF = dataframe.map(row => { val row1 = row.getAs[String](1) val make = if (row1.toLowerCase == "tesla") "S" else row1 Row(row(0),make,row(2)) }) that should construct new DataFrame.
  • Thanks! It works! Feels so good! I set a new data frame and add a new column.
  • Hi! First thanks for solving my problem. Can I convert a DataFrame to RDD only by .rdd? Is there any risk like changing the schema? Thanks again!
  • this will break spark's catalyst optimisations, and therefore is not the best practice, the withColumn approach is best suited for this.
  • hey man, what if i want to change a column with a value from another dataframe column (both dataframes have an id column) i can't seem to make it in java spark.
  • This is probably better served with a select .. join on id, given that, sounds like a new question. Hope that gets you started.
  • Why to edit this one and make it the same answer as @marshall245?
  • Where can I find the doc for withColumn function? I actually have more conditions and more columns to change the values of. I got this docs.azuredatabricks.net/spark/1.6/sparkr/functions/… but this is not helping. Can anyone help?