pyspark: get the distinct elements of list values

pyspark count number of distinct values in column
pyspark dataframe: count distinct values
spark select distinct multiple columns
pyspark count distinct multiple columns
distinct to list pyspark
pyspark select distinct rows
pyspark rdd distinct
pyspark column unique to list

I have an rdd in this form,

rdd = sc.parallelize([('A', [1, 2, 4, 1, 2, 5]), ('B', [2, 3, 2, 1, 5, 10]), ('C', [3, 2, 5, 10, 5, 2])])

but I want to transformed the rdd like below,

newrdd = [('A', [1, 2, 4, 5]), ('B', [2, 3, 1, 5, 10], ('C', [3, 2, 5, 10])]

meaning, I have to get the distinct elements of values. ReduceByKey() doesnt help here.

how can I achieve this?

Since Spark 2.4 you can use the PySpark SQL function array_distinct:

df = rdd.toDF(("category", "values"))
df.withColumn("foo", array_distinct(col("values"))).show()
+--------+-------------------+----------------+
|category|             values|             foo|
+--------+-------------------+----------------+
|       A| [1, 2, 4, 1, 2, 5]|    [1, 2, 4, 5]|
|       B|[2, 3, 2, 1, 5, 10]|[2, 3, 1, 5, 10]|
|       C|[3, 2, 5, 10, 5, 2]|   [3, 2, 5, 10]|
+--------+-------------------+----------------+

It has the advantage of not converting the JVM objects to Python objects and is therefore more efficient than any Python UDF. However, it’s a DataFrame function, so you must convert the RDD to a DataFrame. That’s also recommended for most cases.

pyspark.sql module — PySpark 3.0.0 documentation, pyspark.sql.functions List of built-in functions available for DataFrame . When getting the value of a config, this defaults to the value set in the underlying containing elements in a range from start to end (exclusive) with step value step . The first column of each row will be the distinct values of col1 and the column� I would suggest you to go with the easiest way to do this count of distinct elements of each group using the function countDistinct: import pyspark.sql.functions as func from pyspark.sql.types import TimestampType

Here is a direct way to get the result in Python. Note that the RDDs are immutable.

Setup Spark Session/Context

from pyspark.sql import SparkSession
from pyspark import SparkContext

spark = SparkSession.builder \
            .master("local") \
            .appName("SO Solution") \
            .getOrCreate()

sc = spark.sparkContext

Solution Code

rdd = sc.parallelize([('A', [1, 2, 4, 1, 2, 5]), ('B', [2, 3, 2, 1, 5, 10]), ('C', [3, 2, 5, 10, 5, 2])])

newrdd = rdd.map(lambda x : (x[0], list(set(x[1]))))

newrdd.collect()

Output

[('A', [1, 2, 4, 5]), ('B', [1, 2, 3, 5, 10]), ('C', [10, 2, 3, 5])]

how to get unique values of a column in pyspark dataframe , csv, other functions like describe works on the df. any reason for this? how should I go about retrieving the list of unique values in this case? sorry� Computes a pair-wise frequency table of the given columns. Also known as a contingency table. The number of distinct values for each column should be less than 1e4. At most 1e6 non-zero pair frequencies will be returned. The first column of each row will be the distinct values of col1 and the column names will be the distinct values of col2.

You can convert the array to set to get distinct values. Here is how - I have changed the syntax a little bit to use scala.

    val spark : SparkSession = SparkSession.builder
      .appName("Test")
      .master("local[2]")
      .getOrCreate()
    import spark.implicits._
    val df = spark.createDataset(List(("A", Array(1, 2, 4, 1, 2, 5)), ("B", Array(2, 3, 2, 1, 5, 10)), ("C", Array(3, 2, 5, 10, 5, 2))))
    df.show()

    val dfDistinct = df.map(r=> (r._1, r._2.toSet) )
    dfDistinct.show()

Get the distinct elements of each group by other field on a Spark 1.6 , I'm trying to group by date in a Spark dataframe and for each group count the unique values elements of each group by another field, like� This should help to get distinct values of a column: df.select('column1').distinct().collect() Note that .collect() doesn't have any built-in limit on how many values can return so this might be slow -- use .show() instead or add .limit(20) before .collect() to manage this.

old_rdd = [('A', [1, 2, 4, 1, 2, 5]), ('B', [2, 3, 2, 1, 5, 10]), ('C', [3, 2, 5, 10, 5, 2])]
new_rdd = [(letter, set(numbers)) for letter, numbers in old_rdd]

Like this?

Or list(set(numbers)) if you really need them to be a list?

get unique values when concatenating two columns pyspark data , PySpark Distinct List of Each of the Keys from an RDD, I have an RDD with key value Spark - RDD Distinct - Get Unique Elements - Example, To get distinct� To get distinct elements of an RDD, apply the function distinct on the RDD. The method returns an RDD containing unique/distinct elements. Examples In the following examples of RDD distinct(), we shall take a list or sequence of words and find the distinct of them. Java Example Scala Example Python Example Java Example – Spark

Distinct value of a column in pyspark, To get distinct value of a column in pyspark we will be using distinct() function. distinct value of the column in pyspark using dropDuplicates() function. countDistinct() function returns the number of distinct elements in a columns //countDistinct df2 = df.select(countDistinct("department", "salary")) df2.show(truncate=False) print("Distinct Count of Department & Salary: "+str(df2.collect()[0][0])) count function. count() function returns number of elements in a column.

Spark - RDD Distinct - Get Unique Elements - Example, The method returns an RDD containing unique/distinct elements. Examples In the following examples of RDD distinct(), we shall take a list or sequence of words� Get the distinct elements of each group by other field on a Spark 1.6 Dataframe asked Jul 23, 2019 in Big Data Hadoop & Spark by Aarav ( 11.5k points) apache-spark

Spark SQL, In this Spark SQL tutorial, you will learn different ways to count the distinct values in every column or selected columns of rows in a DataFrame using. distinct() runs distinct on all columns, if you want to get count distinct on selected This function returns the number of distinct elements in a group. In order� List unique values in a pandas column. Special thanks to Bob Haffner for pointing out a better way of doing it.