grouping a pandas DataFrame with predifined groups

I am wondering, how to efficiently do something like groupby when I have predefined groups, and elements might belong to multiple groups at the same time.

Suppose, I have the following DataFrame:

df = pd.DataFrame({'value': [0, 2, 4]}, index=['A', 'B', 'C'])
   value
A      0
B      2
C      4

and I have the following predefined groups, which might be overlapping and of different size:

groups = {'group 1': ['A', 'B'],
          'group 2': ['A', 'B', 'C']}

Now, I want to perform a function on the DataFrame groups. For example, I want to calculate the mean of value for each group.

I was thinking to create an intermediate "expanded" DataFrame on which I could run a groupby:

intermediate_df = pd.DataFrame(columns=['id', 'group', 'value'])
intermediate_df['value'] = intermediate_df['value'].astype(float)

for group, members in groups.items():
    for id_ in members:
        row = pd.Series([id_, group, df.at[id_, 'value']],
                        index=['id', 'group', 'value'])
        intermediate_df = intermediate_df.append(row, ignore_index=True)
  id    group  value
0  A  group 1    0.0
1  B  group 1    2.0
2  A  group 2    0.0
3  B  group 2    2.0
4  C  group 2    4.0

Then, I could do

intermediate_df.groupby('group').mean()

which would give me the desired result:

         value
group         
group 1    1.0
group 2    2.0

Of course, the way I create this intermediate DataFrame is absolutely inefficient. What would be an efficient solution for my problem?


You can create your intermediate_df with Pandas.concat and a list comprehension:

intermediate_df = pd.concat([df.loc[v].assign(group=k) for k, v in groups.items()])

[OUT]

   value    group
A      0  group 1
B      2  group 1
A      0  group 2
C      4  group 2

Group by: split-apply-combine — pandas 1.1.0 documentation, By default the group keys are sorted during the groupby operation. You may however pass sort=False for potential speedups: In [21]: df2 = pd.DataFrame({'X': � Pandas is one of those packages and makes importing and analyzing data much easier. Pandas dataframe.groupby() function is used to split the data into groups based on some criteria. pandas objects can be split on any of their axes. The abstract definition of grouping is to provide a mapping of labels to group names.


Edit try for uneven groups:

pd.DataFrame().from_dict(groups, orient='index').T.stack().map(df.squeeze()).mean(level=1)

You can do it this way also:

pd.DataFrame(groups).stack().map(df.squeeze()).mean(level=1)

Output:

group 1    1
group 2    2
dtype: int64

pandas.DataFrame.groupby — pandas 1.1.0 documentation, as_indexbool, default True. For aggregated output, return object with group labels as the index. Only relevant for DataFrame input. as_index=False is effectively� Split Data into Groups. Pandas object can be split into any of their objects. There are multiple ways to split an object like − obj.groupby('key') obj.groupby(['key1','key2']) obj.groupby(key,axis=1) Let us now see how the grouping objects can be applied to the DataFrame object. Example


Building on previous answers, I use list comprehension for an intermediate_df

intermediate_df = pd.DataFrame([[group, id_] for group, members in groups.items() for id_ in members], 
                               columns=['group', 'id']).merge(df, left_on='id', right_index=True)

This seems to be the fastest solution compared to the other answers:

n=10000
m=1000
df = pd.DataFrame({'value': np.random.normal(size=n)}, index=np.arange(n).astype(str))
groups = {str(i): list(df.sample(5).index) for i in range(m)}
%%timeit
intermediate_df = pd.concat([df.loc[members].assign(group=group) for group, members in groups.items()])
intermediate_df.groupby('group').mean()

948 ms ± 63.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
pd.DataFrame(groups).stack().map(df.squeeze()).mean(level=1)

42.4 ms ± 183 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit
intermediate_df = pd.DataFrame([[group, id_] for group, members in groups.items() for id_ in members], 
                               columns=['group', 'id']).merge(df, left_on='id', right_index=True)
intermediate_df.groupby('group').mean()

6.13 ms ± 50.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Group By: split-apply-combine — pandas 0.23.0 documentation, By default the group keys are sorted during the groupby operation. You may however pass sort=False for potential speedups: In [13]: df2 = pd.DataFrame({'X' � In this tutorial, you'll learn how to work adeptly with the Pandas GroupBy facility while mastering ways to manipulate, transform, and summarize data. You'll work with real-world datasets and chain GroupBy methods together to get data in an output that suits your purpose.


pandas.DataFrame.groupby — pandas 0.22.0 documentation, Group series using mapper (dict or key function, apply given function to group, return result as series) level : int, level name, or sequence of such, default None. Here’s a simplified visual that shows how pandas performs “segmentation” (grouping and aggregation) based on the column values! Pandas .groupby in action. Let’s do the above presented grouping and aggregation for real, on our zoo DataFrame! We have to fit in a groupby keyword between our zoo variable and our .mean() function:


pandas.DataFrame.groupby — pandas 0.25.0 documentation, as_index : bool, default True. For aggregated output, return object with group labels as the index. Only relevant for DataFrame input. as_index=False is� Here is a sample output of my grouped by that I wanted to convert to a dataframe: Because I wanted more than the count provided by reset_index(), I wrote a manual method for converting the image above into a dataframe. I understand this is not the most pythonic/pandas way of doing this as it is quite verbose and explicit, but it was all I needed.


Pandas GroupBy: Your Guide to Grouping Data in Python – Real , of Pandas GroupBy; Pandas GroupBy vs SQL; How Pandas GroupBy Works SELECT state, count(name) FROM df GROUP BY state ORDER BY state; more closely mimic the default SQL output for a similar operation. Summary of Python Pandas Grouping. The groupby functionality in Pandas is well documented in the official docs and performs at speeds on a par (unless you have massive data and are picky with your milliseconds) with R’s data.table and dplyr libraries.