Advanced Pandas chaining: chain index.droplevel after groupby

pandas groupby
pandas groupby index
pandas groupby multiindex
pandas set index
pandas flatten multi index after group by
pandas multiindex
pandas rename multiindex to single index
pandas merge

I was trying to find the top2 values in column2 grouped by column1.

Here is the dataframe:

# groupby id and take only top 2 values.
df = pd.DataFrame({'id':[1,1,1,1,1,1,1,1,1,2,2,2,2,2], 
                    'value':[20,20,20,30,30,30,30,40, 40,10, 10, 40,40,40]})

I have done without using chained grouping:

x = df.groupby('id')['value'].value_counts().groupby(level=0).nlargest(2).to_frame()
x.columns = ['count']
x.index = x.index.droplevel(0)
x = x.reset_index()
x

Result:

   id  value  count
0   1     30      4
1   1     20      3
2   2     40      3
3   2     10      2

Can we do this is ONE-SINGLE chained operation?

So, far I have done this:

(df.groupby('id')['value']
 .value_counts()
 .groupby(level=0)
 .nlargest(2)
 .to_frame()
.rename({'value':'count'}))

Now, I stuck at how to drop the index level. How to do all these operations in one single chain?

You could use apply and head without the second groupby:

df.groupby('id')['value']\
  .apply(lambda x: x.value_counts().head(2))\
  .reset_index(name='count')\
  .rename(columns={'level_1':'value'})

Output:

   id  value  count
0   1     30      4
1   1     20      3
2   2     40      3
3   2     10      2

Timings:

#This method

7 ms ± 116 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

#Groupby and groupby(level=0) with nlargest

12.9 ms ± 136 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

MultiIndex / Advanced Indexing — pandas 0.25.0.dev0+752 , This section covers indexing with a MultiIndex and other advanced indexing features. This is sometimes called chained assignment and should be avoided. Later, when discussing group by and pivoting and reshaping data, we'll show� By chaining is meant a multi-step pipeline that’s issued as one statement. Pandas directly supports only single attribute frequencies with its methods, but I show how the Pandas groupby().size() chain can be a building block of a generic multi-dimensional frequency’s capability.

Try the below:

(df.groupby('id')['value']
.value_counts()
 .groupby(level=0)
 .nlargest(2)
 .to_frame()).rename(columns={'value':'count'}).reset_index([1,2]).reset_index(drop=True)

datas-frame – Modern Pandas (Part 2): Method Chaining, Method chaining, where you call methods on an object one after another, assign (0.16.0): For adding new columns to a DataFrame in a chain (inspired by functions and made them NDFrame methods with a groupby -like API. able to pass a callable to the indexing methods, to be evaluated within the� It's always been a style of programming that's been possible with pandas, and over the past several releases, we've added methods that enable even more chaining. assign (0.16.0): For adding new columns to a DataFrame in a chain (inspired by dplyr's mutate) pipe (0.16.2): For including user-defined methods in method chains.

Yet another solution:

df.groupby('id')['value'].value_counts().rename('count')\
    .groupby(level=0).nlargest(2).reset_index(level=[1, 2])\
    .reset_index(drop=True)

Hierarchical indices, groupby and pandas, I mentioned, in passing, that you may want to group by several columns, in which case the resulting pandas DataFrame ends up with a multi-� They make iterating through the iterables like lists and strings very easily. One such itertools function is chain(). Note: For more information, refer to Python Itertools. chain() function. It is a function that takes a series of iterables and returns one iterable. It groups all the iterables together and produces a single iterable as output.

Using solution from @Scott Boston, I did some testing and also tried to avoid apply altogether, but still apply is as good performant as using numpy:

import numpy as np
import pandas as pd
from collections import Counter

np.random.seed(100)
df = pd.DataFrame({'id':np.random.randint(0,5,10000000), 
                    'value':np.random.randint(0,5,10000000)})

# df = pd.DataFrame({'id':[1,1,1,1,1,1,1,1,1,2,2,2,2,2], 
#                     'value':[20,20,20,30,30,30,30,40, 40,10, 10, 40,40,40]})


print(df.shape)
df.head()
Using apply
%time
df.groupby('id')['value']\
  .apply(lambda x: x.value_counts().head(2))\
  .reset_index(name='count')\
  .rename(columns={'level_1':'value'})

# CPU times: user 3 µs, sys: 0 ns, total: 3 µs
# Wall time: 6.2 µs
Without using apply at al
%time
grouped = df.groupby('id')['value']

res = np.zeros([2,3],dtype=int)
for name, group in grouped:
  data = np.array(Counter(group.values).most_common(2))


  ids = np.ones([2,1],dtype=int) * name
  data = np.append(ids,data,axis=1)
  res = np.append(res,data,axis=0)

pd.DataFrame(res[2:], columns=['id','value','count'])
# CPU times: user 3 µs, sys: 0 ns, total: 3 µs
# Wall time: 5.96 µs

Group By: split-apply-combine — pandas 1.0.5 documentation, Filling NAs within groups with a value derived from each group. Filtration: use cases. See the cookbook for some advanced strategies. A string passed to groupby may refer to either a column or an index level. If a string If you need to rename, then you can add in a chained operation for a Series like this: In [75]:� See more at Selection by Position, Advanced Indexing and Advanced Hierarchical..loc, .iloc, and also [] indexing can accept a callable as indexer. See more at Selection By Callable. Getting values from an object with multi-axes selection uses the following notation (using .loc as an example, but the following applies to .iloc as well).

Pandas Series: droplevel() function, Pandas Series - droplevel() function: The droplevel() function is used to return DataFrame with requested index / column level(s) removed. Overall pandas is one of the reason why Python is such a great language. There are many other interesting pandas features I could have shown, but it’s already enough to understand why a data scientist cannot do without pandas. To sum up, pandas is. simple to use, hiding all the complex and abstract computations behind (generally) intuitive

Pandas GroupBy: Your Guide to Grouping Data in Python – Real , You'll work with real-world datasets and chain GroupBy methods together to get You can read the CSV file into a Pandas DataFrame with read_csv() : In the output above, 4, 19, and 21 are the first indices in df at which the state equals “PA .” advanced api basics best-practices community databases� 47. How to format or suppress scientific notations in a pandas dataframe? Difficulty Level: L2. Suppress scientific notations like ‘e-03’ in df and print upto 4 numbers after decimal. Input. df = pd.DataFrame(np.random.random(4)**10, columns=['random']) df #> random #> 0 3.474280e-03 #> 1 3.951517e-05 #> 2 7.469702e-02 #> 3 5.541282e-28

How to find rate of change across successive rows using time and , pandas.DataFrame.diff, DataFrame.groupby � pandas. Periods to shift for calculating Difference with previous column. df.pivot_table(index='Date' sort order of dimension pills in the Compute using list box on Advanced dialog makes a chained methods: you called methods on an object one after another. what I want to� Similar to the functionality provided by DataFrame and Series, functions that take GroupBy objects can be chained together using a pipe method to allow for a cleaner, more readable syntax. To read about .pipe in general terms, see here. Combining .groupby and .pipe is often useful when you need to reuse GroupBy objects.

Comments
  • Thanks, I am always scared to use the apply function as it is said to be slow and be avoided as much as possible and tend to use groupby, but in this case apply seems to be the winner.
  • @astro123 Yes, you are correct. I would reframe from using applys also. However, in this case versus two groupbys, apply appears to be more performant.