Select rows in a pandas data frame based on multiple column conditions
pandas dataframe filter multiple conditions
pandas filter rows by condition
pandas select columns by condition
pandas select columns by condition on name
pandas isin multiple columns
pandas find value in any column
pandas create new column based on multiple condition
I have a dataset as below
id date time domain activity 1 20thdec 2 amazon add to basket 1 21stdec 2 amazon product view 1 21stdec 3 amazon add to basket 1 21stdec 4 amazon add to basket 2 21stdec 4 amazon add to basket 2 21stdec 6 amazon add to basket
How do I get rid of rows which contain the same value in the activity column (i.e for device_id = 2
the only activity is add to basket
. I still want to preserve the format for id 1
which has multiple(duplicate) values of add to basket
but it has other activity as well
I tried pd.drop_duplicates
however that doesn't solves the problem.
Edit : None of the below solutions works , I Need the output as below
id date time domain activity 1 20thdec 2 amazon add to basket 1 21stdec 2 amazon product view 1 21stdec 3 amazon add to basket 1 21stdec 4 amazon add to basket
The data for id = 2 should be deleted like all the activity irrespective of date/time contains nothing but add to basket, so all the single activity rows should be deleted and only keep rows of data that have multiple activity listed i.e as in id =1 has 2 levels of activity ( Product view and Add to basket)
Apologies if it caused any misinterpretation
Thanks
IIUC, Use groupby
+ transform
with nunique
and calculate cumsum
for values not equal to (ne
) 1
and then use drop_duplicates
with subset
parameter:
df.dropna(how='all',inplace=True) cols = df.columns df['Unique'] = df.groupby('id')['activity'].transform('nunique') mask = df['Unique'].ne(1) df.loc[mask,'Unique'] = df.loc[mask,'Unique'].cumsum() df1 = df.drop_duplicates(subset = ['activity','Unique'])[cols] print(df1) id date time domain activity 0 1 20thdec 2 amazon add to basket 1 1 21stdec 2 amazon product view 2 1 21stdec 3 amazon add to basket 3 1 21stdec 4 amazon add to basket 5 2 21stdec 4 amazon add to basket
Explanation :
print(df.groupby('id')['activity'].transform('nunique')) 0 2 1 2 2 2 3 2 5 1 6 1 Name: activity, dtype: int64 print(df['Unique'].ne(1)) 0 True 1 True 2 True 3 True 5 False 6 False Name: Unique, dtype: bool # After the line df.loc[mask,'Unique'] = df.loc[mask,'Unique'].cumsum() print(df['Unique']) 0 2 1 4 2 6 3 8 5 1 6 1 Name: Unique, dtype: int64
How To Filter Pandas Dataframe By Values of Column?, How do you select rows of pandas DataFrame using multiple conditions? Let’s see how to Select rows based on some conditions in Pandas DataFrame. Selecting rows based on particular column value using '>', '=', '=', '<=', '!=' operator. Code #1 : Selecting all the rows from the given dataframe in which ‘Percentage’ is greater than 80 using basic method.
I think you need transform
with nunique
with filtering by ne
ne(1)
what return not unique groups:
print (df) id date time domain activity 0 1 20thdec 2 amazon add to basket 1 1 21stdec 2 amazon product view 2 1 21stdec 3 amazon add to basket 3 1 21stdec 4 amazon add to basket 4 2 21stdec 4 amazon add to basket 5 2 21stdec 6 amazon add to basket 6 3 21stdec 6 amazon add to basket df = df[df.groupby('id')['activity'].transform('nunique').ne(1)] print (df) id date time domain activity 0 1 20thdec 2 amazon add to basket 1 1 21stdec 2 amazon product view 2 1 21stdec 3 amazon add to basket 3 1 21stdec 4 amazon add to basket
Another solutions for remove only duplicated groups per columns id
and activity
, so unique rows are not removed:
idx = df.loc[~df.duplicated(['id','activity'], keep=False), 'id'].unique() df = df[df['id'].isin(idx)]
Or:
df = df[~df.duplicated(['id','activity'], keep=False).groupby(df['id']).transform('all')] print (df) id date time domain activity 0 1 20thdec 2 amazon add to basket 1 1 21stdec 2 amazon product view 2 1 21stdec 3 amazon add to basket 3 1 21stdec 4 amazon add to basket 6 3 21stdec 6 amazon add to basket
Selecting multiple columns in a pandas dataframe, How do I filter rows of a Pandas DataFrame by column value? Here using a boolean True/False series to select rows in a pandas data frame – all rows with the Name of “Bert” are selected. Python Pandas: Select rows based on conditions. Let’s select all the rows where the age is equal or greater than 40. See the following code.
You can specify a subset
parameter in drop_duplicates
:
dataset.drop_duplicates(subset=['id', 'activity'])
Python Pandas : Select Rows in DataFrame by conditions on , How do I select multiple columns in a DataFrame in Python? I need to select all DataFrame rows where the corresponding attribute is less than or equal to the corresponding value in the dictionary. I know that for selecting rows based on two or more conditions I can write: rows = df[(df[column1] <= dict[column1]) & (df[column2] <= dict[column2])]
As far as I understood, you want to only drop duplicates, where id == 2
. You can still use drop_duplicates
, but you have to specify subset='activity'
in only that rows of the dataframe which have id==2
. Then you concat
it together with that rows having id==1
df = pd.concat([df[df['id'] == 1], df[df['id'] == 2].drop_duplicates(subset='activity')])
gives
id date time domain activity 0 1 20thdec 2 amazon add to basket 1 1 21stdec 2 amazon product view 2 1 21stdec 3 amazon add to basket 3 1 21stdec 4 amazon add to basket 4 2 21stdec 4 amazon add to basket
How to select rows from a DataFrame based on column values , . This method df[['a','b']] produces a copy. You can also use '. # Create variable with TRUE if nationality is USA american = df ['nationality'] == "USA" # Create variable with TRUE if age is greater than 50 elderly = df ['age'] > 50 # Select all cases where nationality is USA and age is greater than 50 df [american & elderly]
Selecting pandas DataFrame Rows Based On Conditions, Select Rows based on any of the multiple values in column Step 3: Select Rows from Pandas DataFrame. You can use the following logic to select rows from pandas DataFrame based on specified conditions: df.loc[df[‘column name’] condition] For example, if you want to get the rows where the color is green, then you’ll need to apply: df.loc[df[‘Color’] == ‘Green’] Where: Color is the column name
Select rows from a Pandas Dataframe based on column values , To select rows whose column value is in an iterable, some_values , use isin : There are several ways to select rows from a pandas data frame: The first thing we'll need is to identify a condition that will act as our criterion for selecting rows. Essentially, we would like to select rows based on one value or multiple values present in a column. Here are SIX examples of using Pandas dataframe to filter rows or select rows based values of a column(s). Let us first load gapminder data as a dataframe into pandas. This data frame has over 6000 rows and 6 columns.
Pandas dataframe filter with Multiple conditions, Method 2: Using variable attributes. # Select all cases where the first name is not missing and nationality is USA df[df You're saying "keep the rows in which df.a isn't -1 and df.b isn't -1", which is the same as dropping every row in which at least one value is -1. For df2: df2 = df[(df.a != -1) | (df.b != -1)] You're saying "keep the rows in which either df.a or df.b is not -1", which is the same as dropping rows where both values are -1.
Comments
- Can you create minimal, complete, and verifiable example ?
df.drop_duplicates(['date','time','domain','activity'])
don't work?- Dear OP please accept
jezrael
answer as he provided the correct result. - I might be overthinking to this problem :). But still, the question is not clear.
- @SandeepKadapa I have edited the question to be more clear, No longer a guessing game :-)
- @NehaSharma
df[df.groupby('id')['activity'].transform('nunique').ne(1)]
this is working. - @SandeepKadapa Thanks it does work