Pandas identifying repetition in subsequent columns and keeping first occurances

pandas duplicated
pandas count duplicate values in column
pandas duplicate rows based on value
pandas sort by column
pandas duplicated example
pandas drop duplicates
count number of duplicate rows pandas
pandas duplicate column

I know how to get rid of duplicate rows in pandas, however my problem is slightly different. Let's assume I have a dataframe like this:

product  from    stop_1        stop_2  stop_3  stop_4 stop_5 stop_6  stop_7
metal    Portugal Spain        France  Ukraine Spain  France Ukraine Spain
fruit    Spain    France       Italy
dairy    Italy    Switzerland  Italy   Switzerland

This is what I want to obtain:

product  from    stop_1   stop_2  stop_3  stop_4 stop_5 stop_6  stop_7
metal    Portugal Spain   France  Ukraine 
fruit    Spain    France  Italy
dairy    Italy    Switzerland  

How I could I get this?


Using mask with duplicated

df.mask(df.apply(lambda x : x.duplicated(),1))
Out[443]: 
  product      from       stop_1  stop_2   stop_3 stop_4 stop_5 stop_6 stop_7
0   metal  Portugal        Spain  France  Ukraine    NaN    NaN    NaN    NaN
1   fruit     Spain       France   Italy      NaN    NaN    NaN    NaN    NaN
2   dairy     Italy  Switzerland     NaN      NaN    NaN    NaN    NaN    NaN

Pandas : Find duplicate rows in a Dataframe based on all or , rows except their first occurrence (default value of keep argument is 'first'). subset column label or sequence of labels, optional. Only consider certain columns for identifying duplicates, by default use all of the columns. keep {‘first’, ‘last’, False}, default ‘first’ Determines which duplicates (if any) to mark. first: Mark duplicates as True except for the first occurrence.


You can use drop_duplicates and reindex

In [417]: df.apply(pd.Series.drop_duplicates, 1).reindex(columns=df.columns)
Out[417]:
  product      from       stop_1  stop_2   stop_3  stop_4  stop_5  stop_6  stop_7
0   metal  Portugal        Spain  France  Ukraine     NaN     NaN     NaN     NaN
1   fruit     Spain       France   Italy      NaN     NaN     NaN     NaN     NaN
2   dairy     Italy  Switzerland     NaN      NaN     NaN     NaN     NaN     NaN

How To Drop Duplicate Rows in Pandas?, Only consider certain columns for identifying duplicates, by default use all of the columns. keep{'first', 'last', False}, default 'first'. Determines which duplicates (if  pandas.DataFrame.drop_duplicates¶. Return DataFrame with duplicate rows removed, optionally only considering certain columns. Indexes, including time indexes are ignored. Only consider certain columns for identifying duplicates, by default use all of the columns. first : Drop duplicates except for the first occurrence.


Here is what I came up with:

df
Out[42]: 
  product      from       stop_1  stop_2  ...   stop_4  stop_5   stop_6 stop_7
0   metal  Portugal        Spain  France  ...    Spain  France  Ukraine  Spain
1   fruit     Spain       France   Italy  ...      NaN     NaN      NaN    NaN
2   dairy     Italy  Switzerland   Italy  ...      NaN     NaN      NaN    NaN

# save column names first
colnames = list(df.columns)
df1 = pd.DataFrame([row.unique() for index, row in df.iterrows()])
# return column names
df1.columns = colnames[0:len(df1.columns)]

df1
Out[46]: 
  product      from       stop_1  stop_2   stop_3
0   metal  Portugal        Spain  France  Ukraine
1   fruit     Spain       France   Italy      NaN
2   dairy     Italy  Switzerland     NaN     None

How to Find & Drop duplicate columns in a DataFrame, Either all duplicates, all except the first or all except the last occurrence of duplicates can be indicated. Parameters. keep{'first', 'last', False}, default 'first'. Method to  Find Duplicate Rows based on all columns. To find & select the duplicate all rows based on all columns call the Daraframe.duplicate() without any subset argument. It will return a Boolean series with True at the place of each duplicated rows except their first occurrence (default value of keep argument is ‘first’ ). Then pass this Boolean


pandas.DataFrame.duplicated, subset : column label or sequence of labels, optional. Only consider certain columns for identifying duplicates, by default use all of the columns. keep : {'first'  The n largest elements where n=3 and keeping the last duplicates. Brunei will be kept since it is the last with value 434000 based on the index order. Brunei will be kept since it is the last with value 434000 based on the index order.


pandas.Series.duplicated, subsetcolumn label or sequence of labels, optional. Only consider certain columns for identifying duplicates, by default use all of the columns. keep{'first', '​last'  1. Select the data column that you want to highlight the duplicates except first. 2. Then click Kutools > Select > Select Duplicate & Unique Cells, see screenshot: 3. In the Select Duplicate & Unique Cells dialog box, select Duplicates (Except 1st one),


pandas.DataFrame.drop_duplicates, From this sample set I would expect a histogram of receipt that shows two occurrences of receipt 102857 (since that person bought two items in one transaction)  To display all duplicate records, i.e. occurrences greater than 1, click the filter arrow in the header of the Occurrences column (the column with the formula), and then click Number Filters > Greater Than. Select " is greater than " in the first box, type 1 in the box next to it, and click the OK button: In a similar manner, you can show 2 nd