Pandas: Custom group-by function

pandas groupby
pandas groupby agg
pandas groupby custom function multiple columns
pandas groupby transform
pandas groupby count
pandas groupby multiple columns
pandas groupby tutorial
pandas groupby aggregate multiple columns

I am looking for a custom group-by function that is going to group the rows in a way such that:

  • If there is any number and 0 it will add the number.
  • If there are two numbers (they will be the same all the time) then it will add the number.
  • If there is a NaN and a NaN it will add a NaN.
  • If there is a number and a NaN it will add the number.

An example to make things more clear:

start_df = pd.DataFrame({"id": [1,1,2,2,3,3,4,4,4,5],
                         "foo": [4, 4, np.nan, 7, np.nan, np.nan, 0, 9, 9, 7],
                         "bar": [np.nan, np.nan, 0, 4, 0, 1, 6, 6, 0, 4]})

    id  foo  bar
0   1   4.0  NaN
1   1   4.0  NaN
2   2   NaN  0.0
3   2   7.0  4.0
4   3   NaN  0.0
5   3   NaN  1.0
6   4   0.0  6.0
7   4   9.0  6.0
8   4   9.0  0.0
9   5   7.0  4.0

After the custom group-by by id:

result_df = pd.DataFrame({"id": [1,2,3,4,5], "foo": [4, 7, np.nan, 9, 7], "bar": [np.nan, 4, 1, 6, 4]})


    id  foo  bar
0   1   4.0  NaN
1   2   7.0  4.0
2   3   NaN  1.0
3   4   9.0  6.0
4   5   7.0  4.0

One solution that I am aware of is:

start_df.groupby("id").max().reset_index()

But it is too slow for my case since the data-frame that I am dealing with is huge. On the other hand, I am not able to cover the edge case where both of the elements are numbers with this solution:

start_df.groupby("id").sum(min_count=1).reset_index()

Looking forward to your help!

Maybe not what you would have thought, but this should work

start_df.groupby('id').max()

Use reset_index if you want to bring 'id' back into the columns.

Group By: split-apply-combine, We'll address each area of GroupBy functionality then provide some non-trivial examples / use cases A Python function, to be called on each of the axis labels. The custom function is applied to a dataframe grouped by order_id. The function splits the grouped dataframe up by order_id.

I believe the solution you are looking that fits ideal.

I have added the below another approach, Specifying as_index=False in groupby keeps the original index using groupby.GroupBy.nth

>>> start_df.groupby('id',  as_index=False).nth(1)
   id  foo  bar
1   1  4.0  NaN
3   2  7.0  4.0
5   3  NaN  1.0
7   4  9.0  6.0

OR

>>> start_df.groupby(['id'], sort=False).max().reset_index()
   id  foo  bar
0   1  4.0  NaN
1   2  7.0  4.0
2   3  NaN  1.0
3   4  9.0  6.0

Learn the optimal way to compute custom groupby aggregations in , Using a custom function to do a complex grouping operation in pandas can be extremely slow. Learn how to pre-calculate columns and stick to  Get sum of score of a group using groupby function in pandas Now lets group by name of the student and Exam and find the sum of score of students across the groups # sum of score group by Name and Exam df['Score'].groupby([df['Name'],df['Exam']]).sum()

here is another approach not with groupby but I can't tell if it is more efficient. The idea is to have the same number of rows for each id to be able to reshape the data and use np.nanmax over an axis. To do so, you can generate a dataframe with the missing values as nan.

#create the count of each id
s = start_df.id.value_counts()
nb_max = s.max()
#ceate the dataframe with nan
df_nan = pd.DataFrame({col: np.nan if col != 'id' 
                                   else [ids for ids, val in zip(s.index,nb_max-s.values) 
                                             for _ in range(val)] 
                       for col in start_df.columns })
#get the result
result_df = pd.DataFrame( np.nanmax( pd.concat([start_df, df_nan])[start_df.columns]
                                       .sort_values('id').values
                                       .reshape((-1,start_df.shape[1],nb_max)), 
                                     axis=1), 
                          columns = start_df.columns)

Note: you get a warning saying some slice are only nan, but it works, there is probably a way to silent this warning.

6-Aggregation-and-Grouping, You can use custom functions when applying on Series and also when operating together into a single table. groupby in pandas works exactly the same way. Pandas object can be split into any of their objects. There are multiple ways to split an object like −. obj.groupby ('key') obj.groupby ( ['key1','key2']) obj.groupby (key,axis=1) Let us now see how the grouping objects can be applied to the DataFrame object.

Pandas groupby custom function to each series, With a custom function, you can do: df.groupby('one')['two'].agg(lambda x: x.diff().​mean()) one a 3 b 1 Name: two, dtype: int64. and reset the  bymapping, function, label, or list of labels Used to determine the groups for the groupby. If by is a function, it’s called on each value of the object’s index. If a dict or Series is passed, the Series or dict VALUES will be used to determine the groups (the Series’ values are first aligned; see.align () method).

Pandas .groupby(), Lambda Functions, & Pivot Tables, This lesson of the Python Tutorial for Data Analysis covers grouping data with pandas .groupby(), using lambda functions and pivot tables, and sorting and  Writing custom aggregation functions with Pandas. you just group by item and sum the value. If there wasn’t such a function we could make a custom sum function and use it with the

How to use the Split-Apply-Combine strategy in Pandas groupby, Instead of using one of the stock functions provided by Pandas to operate on the groups we can define our own custom function and run it on  Using a custom function in Pandas groupby In the previous example, we passed a column name to the groupby method. You can also pass your own function to the groupby method. This function will receive an index number for each row in the DataFrame and should return a value that will be used for grouping.

Comments
  • Is it always sorted? You could just take the tail if so.
  • There is not only one column. This should be applicable for multiple columns. That is why max is too slow for me. I have more than 1M rows and 455 columns.
  • You should update your question to reflect that you are doing this operation for multiple columns. However, if it's sorted, taking the tail should still work across multiple columns.
  • Can you please elaborate on "sorted"?
  • It seems all your values in 'val1' column are sorted within each id. That is, for 'id' == 2, you have NaN then 7.0. Same for 'id' == 4, 0.0 then 9.0.
  • I posted that as a comment on another answer but I see that it is deleted now. However, this is too slow for my case. On the other hand, the problem with start_df.groupby('id').sum(min_count=1).reset_index(), is the case when there are two numbers (it is the same number as mentioned), and it should take one of them.
  • @gorjan Would you mind editing that into your question instead?
  • @ayhan Editted.
  • I mentioned in the question that I am aware of that solution. Searching for the max element is an overkill for my case and thus too slow.