Pandas: Get duplicated indexes

pandas find duplicates
pandas drop duplicate index
pandas drop duplicates
pandas groupby duplicate index
pandas dataframe append duplicate index
pandas rename duplicates
pandas duplicated sum
flag duplicates in pandas

Given a dataframe, I want to get the duplicated indexes, which do not have duplicate values in the columns, and see which values are different.

Specifically, I have this dataframe:

import pandas as pd
wget https://www.dropbox.com/s/vmimze2g4lt4ud3/alt_exon_repeatmasker_intersect.bed
alt_exon_repeatmasker = pd.read_table('alt_exon_repeatmasker_intersect.bed', header=None, index_col=3)

In [74]: alt_exon_repeatmasker.index.is_unique
Out[74]: False

And some of the indexes have duplicate values in the 9th column (the type of DNA repetitive element in this location), and I want to know what are the different types of repetitive elements for individual locations (each index = a genome location).

I'm guessing this will require some kind of groupby and hopefully some groupby ninja can help me out.

To simplify even further, if we only have the index and the repeat type,

genome_location1    MIR3
genome_location1    AluJb
genome_location2    Tigger1
genome_location3    AT_rich

So the output I'd like to see all duplicate indexes and their repeat types, as such:

genome_location1    MIR3
genome_location1    AluJb

EDIT: added toy example

df.groupby(level=0).filter(lambda x: len(x) > 1)['type']

We added filter method for this kind of operation. You can also use masking and transform for equivalent results, but this is faster, and a little more readable too.

Important:

The filter method was introduced in version 0.12, but it failed to work on DataFrames/Series with nonunique indexes. The issue -- and a related issue with transform on Series -- was fixed for version 0.13, which should be released any day now.

Clearly, nonunique indexes are the heart of this question, so I should point out that this approach will not help until you have pandas 0.13. In the meantime, the transform workaround is the way to go. Be ware that if you try that on a Series with a nonunique index, it too will fail.

There is no good reason why filter and transform should not be applied to nonunique indexes; it was just poorly implemented at first.

pandas.Index.duplicated, df.groupby(level=0).filter(lambda x: len(x) > 1)['type']. We added filter method for this kind of operation. You can also use masking and transform  pandas.Index.duplicated¶ Index.duplicated (self, keep='first') [source] ¶ Indicate duplicate index values. Duplicated values are indicated as True values in the resulting array. Either all duplicates, all except the first, or all except the last occurrence of duplicates can be indicated. Parameters

Also useful and very succinct:

df[df.index.duplicated()]

Note that this only returns one of the duplicated rows, so to see all the duplicated rows you'll want this:

df[df.index.duplicated(keep=False)]

Pandas: Get duplicated indexes, array-like. List of duplicated indexes. See also. Index.duplicated: Return boolean array denoting duplicates. Pandas is one of those packages and makes importing and analyzing data much easier. Pandas Index.duplicated() function Indicate duplicate index values. Duplicated values are indicated as True values in the resulting array. Either all duplicates, all except the first, or all except the last occurrence of duplicates can be indicated.

Even faster and better:

df.index.get_duplicates()

Pandas : Find duplicate rows in a Dataframe based on all or , You can see that this returns a pandas Series, not a DataFrame. df.duplicated('​col1'). This checks if there are duplicate values in a particular  I would suggest using the duplicated method on the Pandas Index itself: df3 = df3.loc[~df3.index.duplicated(keep='first')] While all the other methods work, the currently accepted answer is by far the least performant for the provided example.

>>> df[df.groupby(level=0).transform(len)['type'] > 1]
                   type
genome_location1   MIR3
genome_location1  AluJb

Python, import pandas as pd. df = pd.DataFrame({ 'Age' : [ 30 , 30 , 22 , 40 , 20 , 30 , 20 , 25 ],. 'Height' : [ 165 , 165 , 120 , 80 , 162 , 72 , 124 , 81 ],. pandas.DataFrame.duplicated ¶ DataFrame.duplicated(self, subset: Union [Hashable, Sequence [Hashable], NoneType] = None, keep: Union [str, bool] = 'first') → 'Series' [source] ¶ Return boolean Series denoting duplicate rows. Considering certain columns is optional. subsetcolumn label or sequence of labels, optional.

As of 9/21/18 Pandas indicates FutureWarning: 'get_duplicates' is deprecated and will be removed in a future release, instead suggesting the following:

df.index[df.index.duplicated()].unique()

What is the performance impact of non-unique indexes in pandas , In Python's Pandas library, Dataframe class provides a member function to find Let's create a Dataframe with some duplicate rows i.e. Pandas : Get frequency of a value in dataframe column/index & find its positions in  Python Pandas : Replace or change Column & Row index names in DataFrame; Pandas : Convert a DataFrame into a list of rows or columns in python | (list of lists) Pandas : Loop or Iterate over all or certain columns of a dataframe; Pandas : count rows in a dataframe | all or those only that satisfy a condition

Get duplicate rows in pyspark, The row at index 2 and 6 in above dataframe are duplicates and all the three columns Name, Age and Zone matches for these two rows. With Pandas version 0.17, you can set 'keep = False' in the duplicated function to get all the duplicate items. In [1]: import pandas as pd In [2]: df = pd.DataFrame(['a','b','c','d','a','b']) In [3]: df Out[3]: 0 0 a 1 b 2 c 3 d 4 a 5 b In [4]: df[df.duplicated(keep=False)] Out[4]: 0 0 a 1 b 4 a 5 b

pandas.Index.get_duplicates, DataFrame to get a series of booleans indicating whether each index has a duplicate in pd.DataFrame . Use ~ to reverse each entry in the series, then subset the  Python is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric python packages. Pandas is one of those packages and makes importing and analyzing data much easier. An important part of Data analysis is analyzing Duplicate Values and removing them. Pandas duplicated () method helps in analyzing

Dealing with duplicates in pandas DataFrame, Pandas is one of those packages and makes importing and analyzing data much easier. Pandas Index.get_duplicates() function extract duplicated index elements. This function returns a sorted list of index elements which appear more than once in the Index.

Comments
  • Hi, usually it's good practice to simplify question as much as possible and create toy example with input and desired output. Such a questions are answered much faster and will be useful for future readers.
  • this one is not working for me, I've tried even df.groupby(level=0).filter(lambda x: True), receiving Exception: Reindexing only valid with uniquely valued Index objects.
  • Good catch! This particular use encounters a bug that was fixed for v0.13, which obviously many users do not have. Answer updated.
  • Thanks! I'm still on 0.12 and will stick to it until v0.13 is fully released because I'm sharing a codebase and virtualenv messes everything up for me. I'll switch to this once we upgrade! Thanks! I've been using pandas for a year but I'm still wrapping my mind around groupbys
  • When looking at efficiency, I doubt this is faster than df.set_index('type').index.duplicates - since it has to iterate through every single group in order to perform, instead of looking at it "from outside the groups".
  • I didn't know about index.duplicates. Add it as an answer; that's definitely better.
  • good answer but it comes with future warning for pandas 0.23.4 'get_duplicates' is deprecated and will be removed in a future release. You can use idx[idx.duplicated()].unique() instead
  • or, very similarly, with filter like so: df.groupby(level=0).filter(lambda x: len(x) > 1)['type']. Will be faster than transforming and masking.
  • @DanAllan great, may be you add another answer and OP'll accept it?
  • Thanks! I'll accept this answer since it's guaranteed to work with the current release of pandas.
  • I think that's the right choice. Useful conversation, all around.