python pandas: Remove duplicates by columns A, keeping the row with the highest value in column B

pandas drop duplicates based on column value
pandas drop duplicates multiple columns
drop duplicates pandas specific columns
drop duplicate columns pandas
drop duplicates with condition pandas
pandas duplicated
python drop duplicates in two columns
drop duplicates by several columns pandas

I have a dataframe with repeat values in column A. I want to drop duplicates, keeping the row with the highest value in column B.

So this:

A B
1 10
1 20
2 30
2 40
3 10

Should turn into this:

A B
1 20
2 40
3 10

Wes has added some nice functionality to drop duplicates: http://wesmckinney.com/blog/?p=340. But AFAICT, it's designed for exact duplicates, so there's no mention of criteria for selecting which rows get kept.

I'm guessing there's probably an easy way to do this---maybe as easy as sorting the dataframe before dropping duplicates---but I don't know groupby's internal logic well enough to figure it out. Any suggestions?

This takes the last. Not the maximum though:

In [10]: df.drop_duplicates(subset='A', keep="last")
Out[10]: 
   A   B
1  1  20
3  2  40
4  3  10

You can do also something like:

In [12]: df.groupby('A', group_keys=False).apply(lambda x: x.loc[x.B.idxmax()])
Out[12]: 
   A   B
A       
1  1  20
2  2  40
3  3  10

Drop Duplicate Rows in a DataFrame, How do I remove duplicates from one column in Python? Drop duplicates, but keep rows with highest value including ties. python pandas: Remove duplicates by columns A, keeping the row with the highest value in column

The top answer is doing too much work and looks to be very slow for larger data sets. apply is slow and should be avoided if possible. ix is deprecated and should be avoided as well.

df.sort_values('B', ascending=False).drop_duplicates('A').sort_index()

   A   B
1  1  20
3  2  40
4  3  10

Or simply group by all the other columns and take the max of the column you need. df.groupby('A', as_index=False).max()

How To Drop Duplicate Rows in Pandas?, value, you can sort_values(colname) and specify keep equals either first or last . python pandas: Remove duplicates by columns A, keeping the row with the highest value in column B (6) I have a dataframe with repeat values in column A. I want to drop duplicates, keeping the row with the highest value in column B.

Simplest solution:

To drop duplicates based on one column:

df = df.drop_duplicates('column_name', keep='last')

To drop duplicates based on multiple columns:

df = df.drop_duplicates(['col_name1','col_name2','col_name3'], keep='last')

Pandas : Find duplicate rows in a Dataframe based on all or , How can I delete duplicate rows from one column in pandas? id value duplicate a 200 yes a 12 yes b 42 yes c 12 no b 532 yes b 21 yes To track the duplicates I use df['duplicate'] = df.duplicated('id', keep=False) However, I would like to keep the ones with the highest value and either mark or drop the other duplicates.

Try this:

df.groupby(['A']).max()

Python(pandas): removing duplicates based on two columns , rows except their first occurrence (default value of keep argument is 'first'). with value specific same remove one example duplicate drop delete columns column based another Remove duplicate values from JS array python pandas: Remove duplicates by columns A, keeping the row with the highest value in column B

You can try this as well

df.drop_duplicates(subset='A', keep='last')

I referred this from https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html

how to keep=max in pandas.DataFrame.drop_duplicates, How do you remove duplicates from a list in Python? I thought using that will only keep the last value, but nothing change or none of the rows were dropped. $\endgroup$ – Learning May 29 '19 at 0:53 $\begingroup$ @Learning - please see the edited post, which drop duplicate rows only keeping the most recent row, using the datestamp column. $\endgroup$ – n1k31t4 May 29 '19 at 1:30

Drop duplicate row values in Pandas based on a column value , duplicates based on two columns keeping row with max value in another column c_maxes is a Series of the maximum values of C in each group but which is of the same You can do this simply by using pandas drop duplicates function DataFrame({'A' : [1,1,2,3,3], 'B' : [2,2,7,4,4], 'C' : [1,4,1,0,8]}) d  In the above example keep=’last’ argument . Keeps the last duplicate row and delete the rest duplicated rows. So the output will be. Drop the duplicate by column: Now let’s drop the rows by column name. Rows are dropped in such a way that unique column value is retained for that column as shown below. # drop duplicate by a column name.

Removing duplicates and keeping the last entry in pandas, I have a dataframe with repeat values in column A. I want to drop duplicates, keeping the row with the highest value in column B. So this: pandas.DataFrame.drop_duplicates¶. Return DataFrame with duplicate rows removed, optionally only considering certain columns. Indexes, including time indexes are ignored. Only consider certain columns for identifying duplicates, by default use all of the columns. first : Drop duplicates except for the first occurrence.

Pandas remove duplicate rows, As the title says, sometimes, I need to keep the max value where other For the example on http://stackoverflow.com/questions/12497402/python-pandas-​remove-duplicates-by-columns-a-keeping-the-row-with-the-highest You might try sorting and then keeping the first (or last): `df.sort_values(['A', 'B'],  Drop a row if it contains a certain value (in this case, “Tina”) Specifically: Create a new dataframe called df that includes all rows where the value of a cell in the name column does not equal “Tina”.

Comments
  • Note that the URL in the question appears EOL.
  • For an idiomatic and performant way, see this solution below.
  • Small note: The cols and take_last parameters are depreciated and have been replaced by the subset and keep parameters. pandas.pydata.org/pandas-docs/version/0.17.1/generated/…
  • as @Jezzamon says, FutureWarning: the take_last=True keyword is deprecated, use keep='last' instead
  • Is there a reason not to use df.sort_values(by=['B']).drop_duplicates(subset=['A'], keep='last')? I mean this sort_values seems safe to me but I have no idea if it actually is.
  • This answer is now obsolete. See @Ted Petrou's answer below.
  • If you want to use this code but with the case of more than one column in the group_by, you can add .reset_index(drop=True) df.groupby(['A','C'], group_keys=False).apply(lambda x: x.ix[x.B.idxmax()]).reset_index(drop=True) This will reset the index as its default value would be a Multindex compsed from 'A' and 'C'
  • This is actually a cleaver approach. I was wondering if it can be generalized by using some lamba function while dropping. For example how can I drop only values lesser than say average of those duplicate values.
  • Best solution. Thanks.
  • Glad to help. @Flavio
  • My data frame has 10 columns, and I used this code to delete duplicates from three columns. However, it deleted the rows from the rest of the columns. Is there any way to delete the duplicates only for the 4 last columns?
  • D'you know the best idiom to reindex this to look like the original DataFrame? I was trying to figure that out when you ninja'd me. :^)
  • Neat. What if the dataframe contains more columns (e.g. C, D, E)? Max doesn't seem to work in that case, because we need to specify that B is the only column that needs to be maximized.