Keep only duplicates from a DataFrame regarding some field

finding duplicate records in pyspark
pyspark drop duplicates keep first
scala dataframe find duplicates
pyspark count duplicates
how to drop duplicate columns in spark dataframe
check for duplicates in dataframe pyspark
spark rdd find duplicates
spark remove duplicate keys

I have this spark DataFrame:

+---+-----+------+----+------------+------------+
| ID|  ID2|Number|Name|Opening_Hour|Closing_Hour|
+---+-----+------+----+------------+------------+
|ALT|  QWA|     6|null|    08:59:00|    23:30:00|
|ALT|AUTRE|     2|null|    08:58:00|    23:29:00|
|TDR|  QWA|     3|null|    08:57:00|    23:28:00|
|ALT| TEST|     4|null|    08:56:00|    23:27:00|
|ALT|  QWA|     6|null|    08:55:00|    23:26:00|
|ALT|  QWA|     2|null|    08:54:00|    23:25:00|
|ALT|  QWA|     2|null|    08:53:00|    23:24:00|
+---+-----+------+----+------------+------------+

I want to get a new dataframe with only the lines that are not unique regarding the 3 fields "ID", "ID2" and "Number".

It means that I want this DataFrame:

+---+-----+------+----+------------+------------+
| ID|  ID2|Number|Name|Opening_Hour|Closing_Hour|
+---+-----+------+----+------------+------------+
|ALT|  QWA|     6|null|    08:59:00|    23:30:00|
|ALT|  QWA|     2|null|    08:53:00|    23:24:00|
+---+-----+------+----+------------+------------+

Or maybe a dataframe with all the duplicates:

+---+-----+------+----+------------+------------+
| ID|  ID2|Number|Name|Opening_Hour|Closing_Hour|
+---+-----+------+----+------------+------------+
|ALT|  QWA|     6|null|    08:59:00|    23:30:00|
|ALT|  QWA|     6|null|    08:55:00|    23:26:00|
|ALT|  QWA|     2|null|    08:54:00|    23:25:00|
|ALT|  QWA|     2|null|    08:53:00|    23:24:00|
+---+-----+------+----+------------+------------+

One way to do this is by using a pyspark.sql.Window to add a column that counts the number of duplicates for each row's ("ID", "ID2", "Name") combination. Then select only the rows where the number of duplicate is greater than 1.

import pyspark.sql.functions as f
from pyspark.sql import Window

w = Window.partitionBy('ID', 'ID2', 'Number')
df.select('*', f.count('ID').over(w).alias('dupeCount'))\
    .where('dupeCount > 1')\
    .drop('dupeCount')\
    .show()
#+---+---+------+----+------------+------------+
#| ID|ID2|Number|Name|Opening_Hour|Closing_Hour|
#+---+---+------+----+------------+------------+
#|ALT|QWA|     2|null|    08:54:00|    23:25:00|
#|ALT|QWA|     2|null|    08:53:00|    23:24:00|
#|ALT|QWA|     6|null|    08:59:00|    23:30:00|
#|ALT|QWA|     6|null|    08:55:00|    23:26:00|
#+---+---+------+----+------------+------------+

I used pyspark.sql.functions.count() to count the number of items in each group. This returns a DataFrame containing all of the duplicates (the second output you showed).

If you wanted to get only one row per ("ID", "ID2", "Name") combination, you could do using another Window to order the rows.

For example, below I add another column for the row_number and select only the rows where the duplicate count is greater than 1 and the row number is equal to 1. This guarantees one row per grouping.

w2 = Window.partitionBy('ID', 'ID2', 'Number').orderBy('ID', 'ID2', 'Number')
df.select(
        '*',
        f.count('ID').over(w).alias('dupeCount'),
        f.row_number().over(w2).alias('rowNum')
    )\
    .where('(dupeCount > 1) AND (rowNum = 1)')\
    .drop('dupeCount', 'rowNum')\
    .show()
#+---+---+------+----+------------+------------+
#| ID|ID2|Number|Name|Opening_Hour|Closing_Hour|
#+---+---+------+----+------------+------------+
#|ALT|QWA|     2|null|    08:54:00|    23:25:00|
#|ALT|QWA|     6|null|    08:59:00|    23:30:00|
#+---+---+------+----+------------+------------+

Pandas : Find duplicate rows in a Dataframe based on all or , 1. DataFrame.duplicated(subset=None, keep='first') Let's create a Dataframe with some duplicate rows i.e. If we want to compare rows & find duplicates based on selected columns only then we should pass list of column  keep: keep is to control how to consider duplicate value. It has only three distinct value and default is ‘first’. It has only three distinct value and default is ‘first’. If ‘first’, it considers first value as unique and rest of the same values as duplicate.

To extend on pault's really great answer: I often need to subset a dataframe to only entries that are repeated x times, and since I need to do this really often, I turned this into a function that I just import with lots of other helper functions in the beginning of my scripts:

import pyspark.sql.functions as f
from pyspark.sql import Window
def get_entries_with_frequency(df, cols, num):
  if type(cols)==str:
    cols = [cols]
  w = Window.partitionBy(cols)
  return df.select('*', f.count(cols[0]).over(w).alias('dupeCount'))\
           .where("dupeCount = {}".format(num))\
           .drop('dupeCount')

pandas.DataFrame.duplicated, subsetcolumn label or sequence of labels, optional. Only consider certain columns for identifying duplicates, by default use all of the columns. keep{'first', '​last'  Drop Duplicate Rows. drop_duplicates returns only the dataframe’s unique values. Removing duplicate records is sample. To remove duplicates of only one or a subset of columns, specify subset as the individual column or list of columns that should be unique.

Here is a way to do it without Window.

A DataFrame with the duplicates

df.exceptAll(df.drop_duplicates(['ID', 'ID2', 'Number'])).show()
# +---+---+------+------------+------------+
# | ID|ID2|Number|Opening_Hour|Closing_Hour|
# +---+---+------+------------+------------+
# |ALT|QWA|     2|    08:53:00|    23:24:00|
# |ALT|QWA|     6|    08:55:00|    23:26:00|
# +---+---+------+------------+------------+

A DataFrame with all duplicates (using left_anti join)

df.join(df.groupBy('ID', 'ID2', 'Number')\
          .count().where('count = 1').drop('count'),
        on=['ID', 'ID2', 'Number'],
        how='left_anti').show()
# +---+---+------+------------+------------+
# | ID|ID2|Number|Opening_Hour|Closing_Hour|
# +---+---+------+------------+------------+
# |ALT|QWA|     2|    08:54:00|    23:25:00|
# |ALT|QWA|     2|    08:53:00|    23:24:00|
# |ALT|QWA|     6|    08:59:00|    23:30:00|
# |ALT|QWA|     6|    08:55:00|    23:26:00|
# +---+---+------+------------+------------+

pandas.DataFrame.drop_duplicates, subsetcolumn label or sequence of labels, optional. Only consider certain columns for identifying duplicates, by default use all of the columns. keep{'first', '​last'  Return DataFrame with duplicate rows removed. Considering certain columns is optional. Indexes, including time indexes are ignored. Parameters subset column label or sequence of labels, optional. Only consider certain columns for identifying duplicates, by default use all of the columns. keep {‘first’, ‘last’, False}, default ‘first’

Pandas Dataframe Examples: Duplicated Data, Get number of duplicated rows in dataframe; Drop duplicates, keep original row; Drop duplicates based on some columns only; Drop all rows  first : All duplicates except their first occurrence will be marked as True; last : All duplicates except their last occurrence will be marked as True; False : All duplicates except will be marked as True; Some Examples : Let’s create a Dataframe with some duplicate rows i.e.

How to Remove Duplicates from Pandas DataFrame, If so, I'll show you the steps to remove duplicates from a DataFrame using a For example, let's say that you have the following data about boxes, where each For example, what if you want to remove the duplicates under the Color column only? just keep the Color column when assigning the columns in the DataFrame:. what is a succinct way to get a new dataframe that looks like. value otherstuff 0 5 x 1 2 x 2 7 x where rows with the same index have been dropped so only the row with the maximum 'value' remains? As far as I am aware there is no option in df.drop_duplicates to keep the max, only the first or last occurrence.

How to remove duplicate data from python dataframe, Drop Duplicates from a specific Column and Keep last row. We will group the rows for each zone and just keep the last in each group i.e. in original dataframe i.e. index 0 and 4 and we want to keep only index 4 in this zone. With Pandas version 0.17, you can set 'keep = False' in the duplicated function to get all the duplicate items. In [1]: import pandas as pd In [2]: df = pd.DataFrame(['a','b','c','d','a','b']) In [3]: df Out[3]: 0 0 a 1 b 2 c 3 d 4 a 5 b In [4]: df[df.duplicated(keep=False)] Out[4]: 0 0 a 1 b 4 a 5 b