Filter DataFrame to Duplicated Items and Compute Groupwise Means on Result

pandas groupby get first group
pandas group by where clause
pandas groupby multiple columns
pandas groupby iterate
pandas groupby aggregate multiple columns
pandas groupby transform
pandas groupby apply
pandas groupby count

Ok, so here is what I'm trying to do:

I have a DataFrame like this:

data = pd.DataFrame(
{'a' : [1,1,1,2,2,3,3,3],
 'b' : [23,45,62,24,45,34,25,62],
 })

I managed to calculate the mean of column 'a' grouped by the column 'b' by using the following code:

data.groupby('b', as_index=False)['a'].mean()

which resulted in a DataFrame like this:

However, I'd like to only calculate the mean for the values of 'b' that occur more than once in the DataFrame, resulting in a Dataframe like this:

I tried to do it by using the following line:

data.groupby('b', as_index=False).filter(lambda group: len(group)>1)['a'].mean()

but it results in the mean of the lines 1, 2, 4 and 7, which is obviously not what I want. Can someone please help me to obtain the desired DataFrame and tell me what I'm getting wrong on the use of the filter function?

Thank you!


Grouping on Duplicates

You can do this with data['b'].duplicated(keep=False) to create a boolean mask first:

>>> data[data['b'].duplicated(keep=False)].groupby('b', as_index=False)['a'].mean()                                                                         
    b    a
0  45  1.5
1  62  2.0

data.b.duplicated(keep=False) marks all duplicated occurrences as True and lets you restrict output to those rows:

>>> data.b.duplicated(keep=False)                                                                                                                        
0    False
1     True
2     True
3    False
4     True
5    False
6    False
7     True
Name: b, dtype: bool

>>> data[data.b.duplicated(keep=False)]                                                                                                                  
   a   b
1  1  45
2  1  62
4  2  45
7  3  62
Update: Grouping by Arbitrary Number of Occurrences

Can this solution be generalized to look for an arbitrary number of occurrences? Let's say I wanted to calculate the mean only for values that occurred more than 5 times on the DataFrame.

In this scenario, you need to generate a boolean mask of the same shape as in the example above, but using a slightly different approach.

Here is one way:

>>> vc = data['b'].map(data['b'].value_counts(sort=False))
>>> vc                                                                                                                                                   

0    1
1    2
2    2
3    1
4    2
5    1
6    1
7    2
Name: b, dtype: int64

These are the element-wise counts for each element of b. To get this to a mask (say you want means for only count == 2, which is the same as the above in this example, but could be extended for any int):

mask = vc == 2  # or > 5, in your case
data[mask].groupby('b', as_index=False)['a'].mean()

Group By: split-apply-combine, Filtration: discard some groups, according to a group-wise computation that evaluates True or False. Filter out data based on the group sum or mean. Some combination of the above: GroupBy will examine the results of the apply step and try to For DataFrame objects, a string indicating a column to be used to group. 1 Filter DataFrame to Duplicated Items and Compute Groupwise Means on Result Nov 20 '18 1 convert a dictionary with list of tuples to dataframe Aug 20 '19 1 Merging two dataframes pandas on Id and year where year is missing values Feb 25


You can filter before your dataframe via loc before groupby:

df = pd.DataFrame({'a' : [1,1,1,2,2,3,3,3],
                   'b' : [23,45,62,24,45,34,25,62]})

counts = df['b'].value_counts()

res = df.loc[df['b'].isin(counts[counts > 1].index)]\
        .groupby('b', as_index=False)['a'].mean()

print(res)

    b    a
0  45  1.5
1  62  2.0

Group By: split-apply-combine, Filtration: discard some groups, according to a group-wise computation that evaluates with only a few members; Filtering out data based on the group sum or mean. Some combination of the above: GroupBy will examine the results of the apply step Starting with 0.8, pandas Index objects now supports duplicate values. pandas.DataFrame.duplicated¶ DataFrame.duplicated (self, subset: Union [Hashable, Sequence [Hashable], NoneType] = None, keep: Union [str, bool] = 'first') → ’Series’ [source] ¶ Return boolean Series denoting duplicate rows. Considering certain columns is optional. Parameters subset column label or sequence of labels, optional


You were pretty close:

data.groupby('b').filter(lambda g:len(g)>1).groupby('b',as_index=False).mean()

results in the answer you were looking for:

    b    a
0  45  1.5
1  62  2.0

Group By: split-apply-combine, Filtration: discard some groups, according to a group-wise computation that evaluates to groups with only a few members; Filtering out data based on the group sum or mean For DataFrame objects, a string indicating a column to be used to group. Starting with 0.8, pandas Index objects now supports duplicate values. Let us say we want to filter the data frame such that we get a smaller data frame with “year” values equal to 2002. That is, we want to subset the data frame based on values of year column. We keep the rows if its year value is 2002, otherwise we don’t. 1. How to Select Rows of Pandas Dataframe Based on a Single Value of a Column?


Pandas GroupBy, The abstract definition of grouping is to provide a mapping of labels to group Pandas datasets can be split into any of their objects. Convert the dictionary into DataFrame some groups, according to a group-wise computation that evaluates True or False. For Example, Filtering out data based on the group sum or mean. valuesiterable, Series, DataFrame or dict. The result will only be true at a location if all the labels match. If values is a Series, that’s the index. If values is a dict, the keys must be the column names, which must match. If values is a DataFrame, then both the index and column labels must match. Returns.


Python, DataFrame.duplicated(subset=None, keep='first'). Parameters: subset: Takes a column or list of column label. It's default value is none. After passing columns, it​  Mean Function in Python pandas (Dataframe, Row and column wise mean) mean() – Mean Function in python pandas is used to calculate the arithmetic mean of a given set of numbers, mean of a data frame ,column wise mean or mean of column in pandas and row wise mean or mean of rows in pandas , lets see an example of each .


5 Data transformation, This data frame contains all 336,776 flights that departed from New York City in 2013. R either prints out the results, or saves them to a variable. filter() only includes rows where the condition is TRUE ; it excludes both FALSE and NA values. proportion of a total, and y - mean(y) computes the difference from the mean. In this article, we will cover various methods to filter pandas dataframe in Python. Data Filtering is one of the most frequent data manipulation operation. It is similar to WHERE clause in SQL or you must have used filter in MS Excel for selecting specific rows based on some conditions.