Select rows in a pandas data frame based on multiple column conditions

pandas select rows by multiple conditions
pandas dataframe filter multiple conditions
pandas filter rows by condition
pandas select columns by condition
pandas select columns by condition on name
pandas isin multiple columns
pandas find value in any column
pandas create new column based on multiple condition

I have a dataset as below

id     date  time  domain       activity
1  20thdec     2  amazon  add to basket
1  21stdec     2  amazon   product view
1  21stdec     3  amazon  add to basket
1  21stdec     4  amazon  add to basket
2  21stdec     4  amazon  add to basket
2  21stdec     6  amazon  add to basket 

How do I get rid of rows which contain the same value in the activity column (i.e for device_id = 2 the only activity is add to basket. I still want to preserve the format for id 1 which has multiple(duplicate) values of add to basket but it has other activity as well

I tried pd.drop_duplicates however that doesn't solves the problem.

Edit : None of the below solutions works , I Need the output as below

id     date  time  domain       activity
1  20thdec     2  amazon  add to basket
1  21stdec     2  amazon   product view
1  21stdec     3  amazon  add to basket
1  21stdec     4  amazon  add to basket

The data for id = 2 should be deleted like all the activity irrespective of date/time contains nothing but add to basket, so all the single activity rows should be deleted and only keep rows of data that have multiple activity listed i.e as in id =1 has 2 levels of activity ( Product view and Add to basket)

Apologies if it caused any misinterpretation

Thanks

IIUC, Use groupby + transform with nunique and calculate cumsum for values not equal to (ne) 1 and then use drop_duplicates with subset parameter:

df.dropna(how='all',inplace=True)
cols = df.columns
df['Unique'] = df.groupby('id')['activity'].transform('nunique')
mask = df['Unique'].ne(1)
df.loc[mask,'Unique'] = df.loc[mask,'Unique'].cumsum()

df1 = df.drop_duplicates(subset = ['activity','Unique'])[cols]

print(df1)

   id     date  time  domain       activity
0   1  20thdec     2  amazon  add to basket
1   1  21stdec     2  amazon   product view
2   1  21stdec     3  amazon  add to basket
3   1  21stdec     4  amazon  add to basket
5   2  21stdec     4  amazon  add to basket

Explanation :

print(df.groupby('id')['activity'].transform('nunique'))
0    2
1    2
2    2
3    2
5    1
6    1
Name: activity, dtype: int64

print(df['Unique'].ne(1))
0     True
1     True
2     True
3     True
5    False
6    False
Name: Unique, dtype: bool

# After the line df.loc[mask,'Unique'] = df.loc[mask,'Unique'].cumsum()
print(df['Unique'])
0    2
1    4
2    6
3    8
5    1
6    1
Name: Unique, dtype: int64

How To Filter Pandas Dataframe By Values of Column?, How do you select rows of pandas DataFrame using multiple conditions? Let’s see how to Select rows based on some conditions in Pandas DataFrame. Selecting rows based on particular column value using '>', '=', '=', '<=', '!=' operator. Code #1 : Selecting all the rows from the given dataframe in which ‘Percentage’ is greater than 80 using basic method.

I think you need transform with nunique with filtering by ne ne(1) what return not unique groups:

print (df)
   id     date  time  domain       activity
0   1  20thdec     2  amazon  add to basket
1   1  21stdec     2  amazon   product view
2   1  21stdec     3  amazon  add to basket
3   1  21stdec     4  amazon  add to basket
4   2  21stdec     4  amazon  add to basket
5   2  21stdec     6  amazon  add to basket
6   3  21stdec     6  amazon  add to basket

df = df[df.groupby('id')['activity'].transform('nunique').ne(1)]
print (df)

   id     date  time  domain       activity
0   1  20thdec     2  amazon  add to basket
1   1  21stdec     2  amazon   product view
2   1  21stdec     3  amazon  add to basket
3   1  21stdec     4  amazon  add to basket

Another solutions for remove only duplicated groups per columns id and activity, so unique rows are not removed:

idx = df.loc[~df.duplicated(['id','activity'], keep=False), 'id'].unique()
df = df[df['id'].isin(idx)]

Or:

df = df[~df.duplicated(['id','activity'], keep=False).groupby(df['id']).transform('all')]

print (df)
   id     date  time  domain       activity
0   1  20thdec     2  amazon  add to basket
1   1  21stdec     2  amazon   product view
2   1  21stdec     3  amazon  add to basket
3   1  21stdec     4  amazon  add to basket
6   3  21stdec     6  amazon  add to basket

Selecting multiple columns in a pandas dataframe, How do I filter rows of a Pandas DataFrame by column value? Here using a boolean True/False series to select rows in a pandas data frame – all rows with the Name of “Bert” are selected. Python Pandas: Select rows based on conditions. Let’s select all the rows where the age is equal or greater than 40. See the following code.

You can specify a subset parameter in drop_duplicates:

dataset.drop_duplicates(subset=['id', 'activity'])

Python Pandas : Select Rows in DataFrame by conditions on , How do I select multiple columns in a DataFrame in Python? I need to select all DataFrame rows where the corresponding attribute is less than or equal to the corresponding value in the dictionary. I know that for selecting rows based on two or more conditions I can write: rows = df[(df[column1] <= dict[column1]) & (df[column2] <= dict[column2])]

As far as I understood, you want to only drop duplicates, where id == 2. You can still use drop_duplicates, but you have to specify subset='activity' in only that rows of the dataframe which have id==2. Then you concat it together with that rows having id==1

df = pd.concat([df[df['id'] == 1], df[df['id'] == 2].drop_duplicates(subset='activity')])

gives

   id     date  time  domain       activity
0   1  20thdec     2  amazon  add to basket
1   1  21stdec     2  amazon   product view
2   1  21stdec     3  amazon  add to basket
3   1  21stdec     4  amazon  add to basket
4   2  21stdec     4  amazon  add to basket

How to select rows from a DataFrame based on column values , . This method df[['a','b']] produces a copy. You can also use '. # Create variable with TRUE if nationality is USA american = df ['nationality'] == "USA" # Create variable with TRUE if age is greater than 50 elderly = df ['age'] > 50 # Select all cases where nationality is USA and age is greater than 50 df [american & elderly]

Selecting pandas DataFrame Rows Based On Conditions, Select Rows based on any of the multiple values in column  Step 3: Select Rows from Pandas DataFrame. You can use the following logic to select rows from pandas DataFrame based on specified conditions: df.loc[df[‘column name’] condition] For example, if you want to get the rows where the color is green, then you’ll need to apply: df.loc[df[‘Color’] == ‘Green’] Where: Color is the column name

Select rows from a Pandas Dataframe based on column values , To select rows whose column value is in an iterable, some_values , use isin : There are several ways to select rows from a pandas data frame: The first thing we'll need is to identify a condition that will act as our criterion for selecting rows. Essentially, we would like to select rows based on one value or multiple values present in a column. Here are SIX examples of using Pandas dataframe to filter rows or select rows based values of a column(s). Let us first load gapminder data as a dataframe into pandas. This data frame has over 6000 rows and 6 columns.

Pandas dataframe filter with Multiple conditions, Method 2: Using variable attributes. # Select all cases where the first name is not missing and nationality is USA df[df  You're saying "keep the rows in which df.a isn't -1 and df.b isn't -1", which is the same as dropping every row in which at least one value is -1. For df2: df2 = df[(df.a != -1) | (df.b != -1)] You're saying "keep the rows in which either df.a or df.b is not -1", which is the same as dropping rows where both values are -1.

Comments
  • Can you create minimal, complete, and verifiable example ?
  • df.drop_duplicates(['date','time','domain','activity']) don't work?
  • Dear OP please accept jezrael answer as he provided the correct result.
  • I might be overthinking to this problem :). But still, the question is not clear.
  • @SandeepKadapa I have edited the question to be more clear, No longer a guessing game :-)
  • @NehaSharma df[df.groupby('id')['activity'].transform('nunique').ne(1)] this is working.
  • @SandeepKadapa Thanks it does work