## Pandas: Custom group-by function

pandas groupby agg

pandas groupby custom function multiple columns

pandas groupby transform

pandas groupby count

pandas groupby multiple columns

pandas groupby tutorial

pandas groupby aggregate multiple columns

I am looking for a custom group-by function that is going to group the rows in a way such that:

- If there is any number and 0 it will add the number.
- If there are two numbers (they will be the same all the time) then it will add the number.
- If there is a NaN and a NaN it will add a NaN.
- If there is a number and a NaN it will add the number.

An example to make things more clear:

start_df = pd.DataFrame({"id": [1,1,2,2,3,3,4,4,4,5], "foo": [4, 4, np.nan, 7, np.nan, np.nan, 0, 9, 9, 7], "bar": [np.nan, np.nan, 0, 4, 0, 1, 6, 6, 0, 4]}) id foo bar 0 1 4.0 NaN 1 1 4.0 NaN 2 2 NaN 0.0 3 2 7.0 4.0 4 3 NaN 0.0 5 3 NaN 1.0 6 4 0.0 6.0 7 4 9.0 6.0 8 4 9.0 0.0 9 5 7.0 4.0

After the custom group-by by `id`

:

result_df = pd.DataFrame({"id": [1,2,3,4,5], "foo": [4, 7, np.nan, 9, 7], "bar": [np.nan, 4, 1, 6, 4]}) id foo bar 0 1 4.0 NaN 1 2 7.0 4.0 2 3 NaN 1.0 3 4 9.0 6.0 4 5 7.0 4.0

One solution that I am aware of is:

start_df.groupby("id").max().reset_index()

But it is too slow for my case since the data-frame that I am dealing with is huge. On the other hand, I am not able to cover the edge case where both of the elements are numbers with this solution:

start_df.groupby("id").sum(min_count=1).reset_index()

Looking forward to your help!

Maybe not what you would have thought, but this should work

start_df.groupby('id').max()

Use `reset_index`

if you want to bring 'id' back into the columns.

**Group By: split-apply-combine,** We'll address each area of GroupBy functionality then provide some non-trivial examples / use cases A Python function, to be called on each of the axis labels. The custom function is applied to a dataframe grouped by order_id. The function splits the grouped dataframe up by order_id.

I believe the solution you are looking that fits ideal.

I have added the below another approach, Specifying `as_index=False`

in groupby keeps the original index using groupby.GroupBy.nth

>>> start_df.groupby('id', as_index=False).nth(1) id foo bar 1 1 4.0 NaN 3 2 7.0 4.0 5 3 NaN 1.0 7 4 9.0 6.0

OR

>>> start_df.groupby(['id'], sort=False).max().reset_index() id foo bar 0 1 4.0 NaN 1 2 7.0 4.0 2 3 NaN 1.0 3 4 9.0 6.0

**Learn the optimal way to compute custom groupby aggregations in ,** Using a custom function to do a complex grouping operation in pandas can be extremely slow. Learn how to pre-calculate columns and stick to Get sum of score of a group using groupby function in pandas Now lets group by name of the student and Exam and find the sum of score of students across the groups # sum of score group by Name and Exam df['Score'].groupby([df['Name'],df['Exam']]).sum()

here is another approach not with `groupby`

but I can't tell if it is more efficient. The idea is to have the same number of rows for each id to be able to `reshape`

the data and use `np.nanmax`

over an axis. To do so, you can generate a dataframe with the missing values as nan.

#create the count of each id s = start_df.id.value_counts() nb_max = s.max() #ceate the dataframe with nan df_nan = pd.DataFrame({col: np.nan if col != 'id' else [ids for ids, val in zip(s.index,nb_max-s.values) for _ in range(val)] for col in start_df.columns }) #get the result result_df = pd.DataFrame( np.nanmax( pd.concat([start_df, df_nan])[start_df.columns] .sort_values('id').values .reshape((-1,start_df.shape[1],nb_max)), axis=1), columns = start_df.columns)

Note: you get a warning saying some slice are only `nan`

, but it works, there is probably a way to silent this warning.

**6-Aggregation-and-Grouping,** You can use custom functions when applying on Series and also when operating together into a single table. groupby in pandas works exactly the same way. Pandas object can be split into any of their objects. There are multiple ways to split an object like −. obj.groupby ('key') obj.groupby ( ['key1','key2']) obj.groupby (key,axis=1) Let us now see how the grouping objects can be applied to the DataFrame object.

**Pandas groupby custom function to each series,** With a custom function, you can do: df.groupby('one')['two'].agg(lambda x: x.diff().mean()) one a 3 b 1 Name: two, dtype: int64. and reset the bymapping, function, label, or list of labels Used to determine the groups for the groupby. If by is a function, it’s called on each value of the object’s index. If a dict or Series is passed, the Series or dict VALUES will be used to determine the groups (the Series’ values are first aligned; see.align () method).

**Pandas .groupby(), Lambda Functions, & Pivot Tables,** This lesson of the Python Tutorial for Data Analysis covers grouping data with pandas .groupby(), using lambda functions and pivot tables, and sorting and Writing custom aggregation functions with Pandas. you just group by item and sum the value. If there wasn’t such a function we could make a custom sum function and use it with the

**How to use the Split-Apply-Combine strategy in Pandas groupby,** Instead of using one of the stock functions provided by Pandas to operate on the groups we can define our own custom function and run it on Using a custom function in Pandas groupby In the previous example, we passed a column name to the groupby method. You can also pass your own function to the groupby method. This function will receive an index number for each row in the DataFrame and should return a value that will be used for grouping.

##### Comments

- Is it always sorted? You could just take the
`tail`

if so. - There is not only one column. This should be applicable for multiple columns. That is why max is too slow for me. I have more than 1M rows and 455 columns.
- You should update your question to reflect that you are doing this operation for multiple columns. However, if it's sorted, taking the tail should still work across multiple columns.
- Can you please elaborate on "sorted"?
- It seems all your values in
`'val1'`

column are sorted within each`id`

. That is, for`'id' == 2`

, you have`NaN`

then`7.0`

. Same for`'id' == 4`

,`0.0`

then`9.0`

. - I posted that as a comment on another answer but I see that it is deleted now. However, this is too slow for my case. On the other hand, the problem with
`start_df.groupby('id').sum(min_count=1).reset_index()`

, is the case when there are two numbers (it is the same number as mentioned), and it should take one of them. - @gorjan Would you mind editing that into your question instead?
- @ayhan Editted.
- I mentioned in the question that I am aware of that solution. Searching for the max element is an overkill for my case and thus too slow.