Create an array column from other columns after processing the column values

pandas create new column based on other columns
pandas create new column based on condition
create a new column based on two columns from two different dataframes
add multiple columns to dataframe pandas
pandas create column based on values in other columns
pandas create new column based on multiple condition
create new column in dataframe based on other columns python
pandas set column value based on other column

Let's say I have a spark dataframe that includes the categorical columns (School, Type, Group)

------------------------------------------------------------
StudentID  |  School |   Type        |  Group               
------------------------------------------------------------
1          |  ABC    |   Elementary  |  Music-Arts          
2          |  ABC    |   Elementary  |  Football            
3          |  DEF    |   Secondary   |  Basketball-Cricket  
4          |  DEF    |   Secondary   |  Cricket             
------------------------------------------------------------

I need to add one more column to the dataframe as below:

--------------------------------------------------------------------------------------
StudentID  |  School |   Type        |  Group               |  Combined Array
---------------------------------------------------------------------------------------
1          |  ABC    |   Elementary  |  Music-Arts          | ["School: ABC", "Type: Elementary", "Group: Music", "Group: Arts"]
2          |  ABC    |   Elementary  |  Football            | ["School: ABC", "Type: Elementary", "Group: Football"]
3          |  DEF    |   Secondary   |  Basketball-Cricket  | ["School: DEF", "Type: Secondary", "Group: Basketball", "Group: Cricket"]
4          |  DEF    |   Secondary   |  Cricket             | ["School: DEF", "Type: Secondary", "Group: Cricket"]
----------------------------------------------------------------------------------------

The extra column is combination of all categorical columns but includes a different processing on 'Group' column. The values of 'Group' column need to be split on '-'.

All the categorical columns including 'Group' are contained in a list. The 'Group' column is also input as a String as the column to be split on. The data-frame has other columns which are not used.

I am looking for the best performance solution.

If it's a simple array, it can be done with a single 'withColumn' transformation.

val columns = List("School", "Type", "Group")
var df2 = df1.withColumn("CombinedArray", array(columns.map(df1(_)):_*))

However, here because of the additional processing in 'Group' column, the solution doesn't seem straightforward.

Using regex replacement to start of each field and to "-" in between:

val df1 = spark.read.option("header","true").csv(filePath)
val columns = List("School", "Type", "Group")
var df2 = df1.withColumn("CombinedArray", array(columns.map{
   colName => regexp_replace(regexp_replace(df1(colName),"(^)",s"$colName: "),"(-)",s", $colName: ")
}:_*))

How to create a new column based on values from other columns in , How do I create a new column from another column in pandas? Much code won't notice this, but if you end up having to iterate over an array of records, this will be a hotspot for you. Structured arrays are sometimes confusingly called record arrays. . - lightly adapted from a Robert Kern post of Thu, 26 Jun 2008 15:25:11 -0500

Using the spark.sql(), Check this out:

Seq(("ABC","Elementary","Music-Arts"),("ABC","Elementary","Football"),("DEF","Secondary","Basketball-Cricket"),("DEF","Secondary","Cricket"))
  .toDF("School","Type","Group").createOrReplaceTempView("taba")
spark.sql( """ select school, type, group, array(concat('School:',school),concat('type:',type),concat('group:',group)) as combined_array from taba """).show(false)

Output:

+------+----------+------------------+------------------------------------------------------+
|school|type      |group             |combined_array                                        |
+------+----------+------------------+------------------------------------------------------+
|ABC   |Elementary|Music-Arts        |[School:ABC, type:Elementary, group:Music-Arts]       |
|ABC   |Elementary|Football          |[School:ABC, type:Elementary, group:Football]         |
|DEF   |Secondary |Basketball-Cricket|[School:DEF, type:Secondary, group:Basketball-Cricket]|
|DEF   |Secondary |Cricket           |[School:DEF, type:Secondary, group:Cricket]           |
+------+----------+------------------+------------------------------------------------------+

If you need it as a dataframe, then

val df = spark.sql( """ select school, type, group, array(concat('School:',school),concat('type:',type),concat('group:',group)) as combined_array from taba """)
df.printSchema()

root
 |-- school: string (nullable = true)
 |-- type: string (nullable = true)
 |-- group: string (nullable = true)
 |-- combined_array: array (nullable = false)
 |    |-- element: string (containsNull = true)

Update:

Dynamically constructing the sql columns.

scala> val df = Seq(("ABC","Elementary","Music-Arts"),("ABC","Elementary","Football"),("DEF","Secondary","Basketball-Cricket"),("DEF","Secondary","Cricket")).toDF("School","Type","Group")
df: org.apache.spark.sql.DataFrame = [School: string, Type: string ... 1 more field]

scala> val columns = df.columns.mkString("select ", ",", "")
columns: String = select School,Type,Group

scala> val arr = df.columns.map( x=> s"concat('"+x+"',"+x+")" ).mkString("array(",",",") as combined_array ")
arr: String = "array(concat('School',School),concat('Type',Type),concat('Group',Group)) as combined_array "

scala> val sql_string = columns + " , " + arr + " from taba "
sql_string: String = "select School,Type,Group , array(concat('School',School),concat('Type',Type),concat('Group',Group)) as combined_array  from taba "

scala> df.createOrReplaceTempView("taba")

scala> spark.sql(sql_string).show(false)
+------+----------+------------------+---------------------------------------------------+
|School|Type      |Group             |combined_array                                     |
+------+----------+------------------+---------------------------------------------------+
|ABC   |Elementary|Music-Arts        |[SchoolABC, TypeElementary, GroupMusic-Arts]       |
|ABC   |Elementary|Football          |[SchoolABC, TypeElementary, GroupFootball]         |
|DEF   |Secondary |Basketball-Cricket|[SchoolDEF, TypeSecondary, GroupBasketball-Cricket]|
|DEF   |Secondary |Cricket           |[SchoolDEF, TypeSecondary, GroupCricket]           |
+------+----------+------------------+---------------------------------------------------+


scala>

Update2:

scala>  val df = Seq((1,"ABC","Elementary","Music-Arts"),(2,"ABC","Elementary","Football"),(3,"DEF","Secondary","Basketball-Cricket"),(4,"DEF","Secondary","Cricket")).toDF("StudentID","School","Type","Group")
df: org.apache.spark.sql.DataFrame = [StudentID: int, School: string ... 2 more fields]

scala> df.createOrReplaceTempView("student")

scala>  val df2 = spark.sql(""" select studentid, collect_list(concat('Group:', t.sp1)) as sp2 from (select StudentID,School,Type,explode((split(group,'-'))) as sp1 from student where size(split(group,'-')) > 1 ) t group by studentid """)
df2: org.apache.spark.sql.DataFrame = [studentid: int, sp2: array<string>]

scala> val df3 = df.alias("t1").join(df2.alias("t2"),Seq("studentid"),"LeftOuter")
df3: org.apache.spark.sql.DataFrame = [StudentID: int, School: string ... 3 more fields]

scala> df3.createOrReplaceTempView("student2")

scala> spark.sql(""" select studentid, school,group, type, array(concat('School:',school),concat('type:',type),concat_ws(',',temp_arr)) from (select studentid,school,group,type, case when sp2 is null then array(concat("Group:",group)) else sp2 end as temp_arr from student2) t """).show(false)
+---------+------+------------------+----------+---------------------------------------------------------------------------+
|studentid|school|group             |type      |array(concat(School:, school), concat(type:, type), concat_ws(,, temp_arr))|
+---------+------+------------------+----------+---------------------------------------------------------------------------+
|1        |ABC   |Music-Arts        |Elementary|[School:ABC, type:Elementary, Group:Music,Group:Arts]                      |
|2        |ABC   |Football          |Elementary|[School:ABC, type:Elementary, Group:Football]                              |
|3        |DEF   |Basketball-Cricket|Secondary |[School:DEF, type:Secondary, Group:Basketball,Group:Cricket]               |
|4        |DEF   |Cricket           |Secondary |[School:DEF, type:Secondary, Group:Cricket]                                |
+---------+------+------------------+----------+---------------------------------------------------------------------------+


scala>

Adding two columns in Python, How do I add a column to a DataFrame in PySpark? Create HTML table – Here I can choose from the “Filter array”-body or the “Get items”-value. I choose the “Filter array”-body to only get the 2 rows. Send an email (V2) – The email-function works fine but the array contains everything. I would like to narrow it down to two or three columns so i do not get that 3 meter wide email.

You need to first add an empty column then map it like so (in Java):

StructType newSchema = df1.schema().add("Combined Array", DataTypes.StringType);

df1 = df1.withColumn("Combined Array", lit(null))
        .map((MapFunction<Row, Row>) row ->
            RowFactory.create(...values...) // add existing values and new value here
        , newSchema);

It should be fairly similar in Scala.

Why do we need to specify the column size when passing a 2D , Why do we need to specify the column size when passing a 2d array as a parameter? The extra column is combination of all categorical columns but includes a different processing on 'Group' column. The values of 'Group' column need to be split on '-'. All the categorical columns including 'Group' are contained in a list. The 'Group' column is also input as a String as the column to be split on.

Create a new column in Pandas DataFrame based on the existing , Create a new column in Pandas DataFrame based on the existing columns. While working with data in Pandas, we perform a vast array of operations on the data to get the to create a new column called 'Discounted_Price', which is calculated after applying a discount of last_page Python | Different ways to kill a Thread. There are several ways to create arrays from column values: Extract matches into an array using patterns; Nest columns into an array; Use one of several functions: LIST, UNIQUE, LISTIF, ROLLINGLIST, RANGE; Column Data Type. To be recognized as an array, a source column must contain values that are: Bracketed by square brackets

Tips for Selecting Columns in a DataFrame, Sometimes it gets tricky to remember each column name and where it is by index. rename a bunch of columns, you can use a dictionary comprehension to create a One other common task I frequently have is to rename a bunch of Since both functions can take a boolean array as input, there are times  Flattens groups of rows and puts the values of the column in an array (with predicate). summarize make_list_with_nulls(column) Flattens groups of rows and puts the values of the column in an array, including null values. summarize make_set(column) Flattens groups of rows and puts the values of the column in an array, without duplication.

Pyspark dataframe split array column, PySpark function explode(e: Column) is used to explode or create array or map columns to rows. them with reasonable values and then, later on, combine all the dataframes into Some of the columns are single values, and others are lists​. The I select the column which has the reviews and do some pre-processing on  Table objects store data with multiple rows and columns, much like in a traditional spreadsheet. Tables can be generated from scratch, dynamically, or using data from an existing file. Tables can be generated from scratch, dynamically, or using data from an existing file.

Comments
  • Just to be sure: why do you want redundant information in the combined column? I get why you want an array containing the "-"-split of the group, but i am less sure about the other values. I suggest df.withColumn("combined", split($"Group", "-"))
  • The column will be fed to countVectorizer, so each entry of the array (category: value) will be identified differently. For instance the same value may be present across different categories.
  • Ah I see, well stack0114106 got the correct answer if you add the splitting of the group to it. ;)
  • In case you do not wanna wrangle so much with String-concats in order to put identifying prefixes to the different type of informations (which might be a little annoying for the Group category), you could also just do: df.withColumn("combined", split($"Group", "-")).withColumn("SchoolArray", array($"School")).withColumn("TypeArray", array($"Type")) and just apply 3 CountVectorizers for each of the "XYZArrays" and a final VectorAssembler to put all together. This version has the benefit, that you can define different minimum frequencies for each of the CountVectorizers.
  • This would work probably. I will need to modify a little to include splits only for selected columns among the category columns. Will try to work it out and post the answer here.
  • Reason for unaccepting answer? Works as expected output you mentioned.
  • I will accept it once I am able to work on your code to get the exact solution. The split need to be done for one column only: 'Group', not for all the columns
  • The below code would be the accurate answer: var df2 = df.withColumn("CombinedArray", array(columns.map( colName => { colName match { case "Group" => regexp_replace(regexp_replace(df(colName),"(^)",s"$colName: "),"(-)",s", $colName: ") case _ => regexp_replace(df(colName),"(^)",s"$colName: ") } }):_*))
  • This solution doesn't address the core issue where the 'Group' values need to be split dynamically
  • Thanks, but the update still doesn't address the core issue here: If you look at my output, the first and 3rd row has an array size of 4. We need to split the 'Group' column based on '-' and add multiple elements to array, one for each split.
  • @John Subas.. could you pls check Update2