Remove reversed duplicates from a data frame

Can anyone suggest a good solution to remove reversed duplicates from a data frame?

My data looks like this, where first and second columns are reversed duplicates.

TRINITY_DN16813_c0_g1_i3    TRINITY_DN16813_c0_g1_i4    96.491  228 8   0   202 429 417 190 3.049999999999999e-104  377
TRINITY_DN16813_c0_g1_i4    TRINITY_DN16813_c0_g1_i3    96.104  231 9   0   190 420 429 199 2.979999999999999e-104  377

I need to keep only one row, where third column has the higher value

TRINITY_DN16813_c0_g1_i3    TRINITY_DN16813_c0_g1_i4    96.491  228 8   0   202 429 417 190 3.049999999999999e-104  377

This the results when I use series.isin().

TRINITY_DN28139_c0_g1_i2    TRINITY_DN28139_c0_g1_i5    99.971  3465    1   0   1   3465    1   3465    0.0 6394
TRINITY_DN28139_c0_g1_i5    TRINITY_DN28139_c0_g1_i2    99.971  3465    1   0   1   3465    1   3465    0.0 6394
TRINITY_DN25313_c0_g1_i6    TRINITY_DN25313_c0_g1_i5    99.97   3315    1   0   1   3315    1   3315    0.0 6117
TRINITY_DN25313_c0_g1_i5    TRINITY_DN25313_c0_g1_i6    99.97   3315    1   0   1   3315    1   3315    0.0 6117
TRINITY_DN25502_c0_g1_i3    TRINITY_DN25502_c0_g1_i4    99.96799999999999   3078    1   0   1   3078    1   3078    0.0 5679
TRINITY_DN25502_c0_g1_i4    TRINITY_DN25502_c0_g1_i3    99.96799999999999   3078    1   0   1   3078    1   3078    0.0 5679
TRINITY_DN28726_c0_g1_i2    TRINITY_DN28726_c0_g1_i1    99.96600000000001   5805    2   0   1   5805    1   5805    0.0 10709
TRINITY_DN28726_c0_g1_i1    TRINITY_DN28726_c0_g1_i2    99.96600000000001   5805    2   0   1   5805    1   5805    0.0 10709
TRINITY_DN27942_c0_g1_i7    TRINITY_DN27942_c0_g1_i6    99.964  2760    1   0   1   2760    1   2760    0.0 5092
TRINITY_DN25118_c0_g1_i1    TRINITY_DN25118_c0_g1_i2    99.964  2770    1   0   81  2850    204 2973    0.0 5110
TRINITY_DN27942_c0_g1_i6    TRINITY_DN27942_c0_g1_i7    99.964  2760    1   0   1   2760    1   2760    0.0 5092
TRINITY_DN25118_c0_g1_i2    TRINITY_DN25118_c0_g1_i1    99.964  2770    1   0   204 2973    81  2850    0.0 5110
TRINITY_DN28502_c1_g1_i9    TRINITY_DN28502_c1_g1_i7    99.963  2678    1   0   1928    4605    2021    4698    0.0 4940
TRINITY_DN28502_c1_g1_i7    TRINITY_DN28502_c1_g1_i9    99.963  2678    1   0   2021    4698    1928    4605    0.0 4940
TRINITY_DN25619_c0_g1_i1    TRINITY_DN25619_c0_g1_i8    99.963  2715    1   0   1   2715    1   2715    0.0 5009
TRINITY_DN25619_c0_g1_i8    TRINITY_DN25619_c0_g1_i1    99.963  2715    1   0   1   2715    1   2715    0.0 5009
TRINITY_DN23022_c0_g1_i5    TRINITY_DN23022_c0_g1_i1    99.962  2622    1   0   1   2622    1   2622    0.0 4837

Try this one. It's completely in pandas (should be faster) This also corrects bugs in my previous answer but the concept of taking the labels as a pair remains the same.

In [384]: df['pair'] = df[[0, 1]].apply(lambda x: '{}-{}'.format(*sorted((x[0], x[1]))), axis=1)

Get only max values per duplicated result:

In [385]: dfd = df.loc[df.groupby('pair')[2].idxmax()]

If you need the names to be in separate columns:

In [398]: dfd[0] = dfd['pair'].transform(lambda x: x.split('-')[0])
In [399]: dfd[1] = dfd['pair'].transform(lambda x: x.split('-')[1])

Pandas is one of those packages and makes importing and analyzing data much easier. An important part of Data analysis is analyzing Duplicate Values and removing them. Pandas drop_duplicates () method helps in removing duplicates from the data frame. Syntax: DataFrame.drop_duplicates (subset=None, keep=’first’, inplace=False)

Use series.isin() to find same entries in both columns and drop duplicates:

df=df.sort_values('col3',ascending=False)
df.loc[df['col1'].isin(df['col2']).drop_duplicates().index]

Where col1 is the first column and col2 is the second

Output:

0   TRINITY_DN16813_c0_g1_i3    TRINITY_DN16813_c0_g1_i4    96.49   228 8   0   202 429 417 190 0.00    377

Steps to Remove Duplicates from Pandas DataFrame. Step 1: Gather the data that contains duplicates. Firstly, you’ll need to gather the data that contains the duplicates. For example, let’s say Step 2: Create Pandas DataFrame. Step 3: Remove duplicates from Pandas DataFrame.

The problem is that labels in column 0 and column 1 must be taken as a pair so an isin alone would not work

First, a list of label pairs is needed to compare to (forward in the code). Given that (a,b) is the same as (b,a), all instances will just be replaced by (a,b)

Then all labels that are duplicated are renamed in the order a,b even if the higher row is b,a. This is necessary to do the grouping step later.

In [293]: df['pair'] = df[[0, 1]].apply(l, axis=1)

Then to account for the value of column 2 (third column from left), the original data is grouped and the min of the group is kept. This will be the rows to be removed.

In [297]: dfi = df.set_index(['pair',2])

In [298]: to_drop = df.groupby([0,1])[2].min().reset_index().set_index([0,1,2]).index

In [299]: dfi['drop'] = dfi.index.isin(to_drop)

In [300]: dfr = dfi.reset_index()

Rows are dropped by the index number where the 'drop' column is True. The temporary 'drop' column is also removed.

In [301]: df_dropped = dfr.drop(np.where(dfr['drop'])[0], axis=0).drop('drop', axis=1)

In [302]: df_dropped
Out[302]:
                         0                         1       2    3   4   5    6    7    8    9              10   11
0  TRINITY_DN16813_c0_g1_i3  TRINITY_DN16813_c0_g1_i4  96.491  228   8   0  202  429  417  190  3.050000e-104  377

Removing duplicate records is sample. df = df.drop_duplicates() print(df) To remove duplicates of only one or a subset of columns, specify subset as the individual column or list of columns that should be unique. To do this conditional on a different column’s value, you can sort_values (colname) and specify keep equals either first or last.

Remove duplicate rows in a data frame The function distinct () [ dplyr package] can be used to keep only unique/distinct rows from a data frame. If there are duplicate rows, only the first row is preserved. It’s an efficient version of the R base function unique ().

data_obj.drop_duplicates() It will remove all duplicates values and will give a dataset with unique values. Method 2: Remove the columns with the most duplicates. In this method instead of removing the entire rows value, you will remove the column with the most duplicates values. drop_duplicates([colum_list])

the data frame to deduplicate. field: the name or index of the column to check for duplicate values. method: the manner of duplicate detection; exact removes exact text duplicates, stringdist removes duplicates below a similarity threshold, and fuzzy uses fuzzdist matching

Comments
  • is it always that every 2 rows or more is reverse duplicated? can there be a 3rd row where col1 value is repeated but col2 value is not? if no check below answer
  • There is no more than two row where first and second columns are reversed duplicates and there can be a 3rd row where col1 value is repeated but col2 value is not and or col2 value is repeated but col1 value is not.
  • can you please post some more data so we can test and get a solution? thanks.
  • I added the results to my question when I use series.isin()
  • Very nice. +1 from me. :)
  • Thanks so much, not sure 'col2' in first line should be col3 to sort values!! However second line does not work! It doesn't drop duplicates!
  • @aboozarsoorni yes should be col3, also why is it not dropping duplicates? is the data exactly same or a little different?
  • may Col3 values be exactly the same or a little different
  • Thanks so much! What is 'forward' in 'r' list (first line)?
  • I edited the answer. forward creates a list of unique label pairs for all label pairs sorted alphabetically. So (TRINITY_DN16813_c0_g1_i3, TRINITY_DN16813_c0_g1_i4) and (TRINITY_DN16813_c0_g1_i4, TRINITY_DN16813_c0_g1_i3) will be stored as the just (TRINITY_DN16813_c0_g1_i3, TRINITY_DN16813_c0_g1_i4)
  • r creates a list of pairs to modify col 0 and 1 (the labels) if it is duplicated. So (TRINITY_DN16813_c0_g1_i3, TRINITY_DN16813_c0_g1_i4) and (TRINITY_DN16813_c0_g1_i4, TRINITY_DN16813_c0_g1_i3) will have the same labels (TRINITY_DN16813_c0_g1_i3, TRINITY_DN16813_c0_g1_i4) for col 0 and 1 respectively
  • By the way, I just realized from looking at your data that you are doing a reciprocal blast on assembled sequences.
  • Thanks great! Edit code works very well! Only a bit change: dfd = df.sort_values(2).drop_duplicates('pair').drop('pair', axis=1)