How to merge columns after groupby and selecting first valid value of other columns in a pandas dataframe?
pandas groupby multiple columns
pandas groupby transform
pandas groupby apply
pandas groupby keep columns
pandas groupby tutorial
I have a pandas dataframe of the form:
ID col_1 col_2 col_3 Date 1 20 40 1/1/2018 1 10 1/2/2018 1 50 60 1/3/2018 3 40 10 90 1/1/2018 4 80 80 1/1/2018
The problem is, I need to create a new dataframe with the first valid values for each column BUT also additional columns derived from 'Date', which correspond to the time those values were matched by in the original dataframe.
In other words:
ID first_col_1 Date_col_1 first_col_2 Date_col_2 first_col_3 Date_col_3 1 10 1/2/2018 20 1/1/2018 40 1/1/2018 3 40 1/1/2018 10 1/1/2018 90 1/1/2018 4 1/1/2018 80 1/1/2018 80 1/1/2018
I understand getting the first valid value per column per ID is as simple as
But how do I extract the relevant 'Date' information for each column?
You don't need to loop, but you do need to "melt" your dataframe before your group-by operation.
So starting with:
from io import StringIO import pandas f = StringIO("""\ ID,col_1,col_2,col_3,Date 1,,20,40,1/1/2018 1,10,,,1/2/2018 1,50,,60,1/3/2018 3,40,10,90,1/1/2018 4,,80,80,1/1/2018 """) df = pandas.read_csv(f)
You can then:
print( df.melt(id_vars=['ID', 'Date'], value_vars=['col_1', 'col_2', 'col_3'], value_name='first') .groupby(by=['ID', 'variable']) .first() .unstack(level='variable') )
Which gives you:
Date first variable col_1 col_2 col_3 col_1 col_2 col_3 ID 1 1/1/2018 1/1/2018 1/1/2018 10.0 20.0 40.0 3 1/1/2018 1/1/2018 1/1/2018 40.0 10.0 90.0 4 1/1/2018 1/1/2018 1/1/2018 NaN 80.0 80.0
The columns are multi-level, so you we can put some polish on them if you want:
def flatten_columns(df, sep='_'): newcols = [sep.join(_) for _ in df.columns] return df.set_axis(newcols, axis='columns', inplace=False) print( df.melt(id_vars=['ID', 'Date'], value_vars=['col_1', 'col_2', 'col_3'], value_name='first') .groupby(by=['ID', 'variable']) .first() .unstack(level='variable') .sort_index(level='variable', axis='columns') .pipe(flatten_columns) )
Which gives you something with not quite the same column order as your example, but it's as close as I feel like making it.
Date_col_1 first_col_1 Date_col_2 first_col_2 Date_col_3 first_col_3 ID 1 1/1/2018 10.0 1/1/2018 20.0 1/1/2018 40.0 3 1/1/2018 40.0 1/1/2018 10.0 1/1/2018 90.0 4 1/1/2018 NaN 1/1/2018 80.0 1/1/2018 80.0
Group By: split-apply-combine, Filling NAs within groups with a value derived from each group. Creating the GroupBy object only verifies that you've passed a valid mapping. GroupBy will tab complete column names (and other attributes): The values are tuples whose first element is the column to select and the second element is the aggregation to pandas.DataFrame.combine_first ¶ DataFrame.combine_first(self, other: ‘DataFrame’) → ’DataFrame’ [source] ¶ Update null elements with value in the same location in other. Combine two DataFrame objects by filling null values in one DataFrame with non-null values from other DataFrame.
newdf=df.melt(['ID','Date']).loc[lambda x : x.value!=''] newdf= newdf.groupby(['ID','variable']).first().unstack().sort_index(level=1,axis=1) newdf.columns=newdf.columns.map('_'.join) newdf Date_col_1 value_col_1 Date_col_2 value_col_2 Date_col_3 value_col_3 ID 1 1/2/2018 10.0 1/1/2018 20.0 1/1/2018 40.0 3 1/1/2018 40.0 1/1/2018 10.0 1/1/2018 90.0 4 None NaN 1/1/2018 80.0 1/1/2018 80.0
Group By: split-apply-combine, Starting with 0.8, pandas Index objects now supports duplicate values. Series([1, 2, 3, 10, 20, 30], lst) In : grouped = s.groupby(level=0) In : grouped.first() Out: 1 1 2 2 Creating the GroupBy object only verifies that you've passed a valid mapping. GroupBy will tab complete column names (and other attributes). A very simple example can be grouping by a specific column value (eg. “species” in our table) and summing all the remaining applicable columns: df.groupby(‘species’).sum() Alternatively, we can specify which columns are to be summed up.
I think you have to loop over the columns, and extract the first values for each of them before concatenating. I can't see a simpler way to do that.
# Create a list to store the dataframes you want for each column sub_df = [pd.DataFrame(df['ID'].unique(), columns=['ID'])] # Init this list with IDs for col in df.columns[1:-1]: # loop over the columns (except ID and Date) # Determine the first valid rows indexes for this column (group by ID) valid_rows = df.groupby('ID')[col].apply(lambda sub_df: sub_df.first_valid_index()) # Extracting the values and dates corresponding to these rows new_sub_df = df[[col, 'Date']].ix[valid_rows].reset_index(drop=True) # Append to the list of sub DataFrames sub_df.append(new_sub_df) # Concatenate all these DataFrames. new_df = pd.concat(sub_df, axis=1)
Pandas GroupBy: Your Guide to Grouping Data in Python – Real , This tutorial assumes you have some experience with Pandas itself, including how The dataset contains members' first and last names, birth date, In SQL, you could find this answer with a SELECT statement: You call .groupby() and pass the name of the column you want to group on, which is "state" . The only difference between the two is the order of the columns: the first input’s columns will always be the first in the newly formed DataFrame. merge() is the most complex of the Pandas data combination tools. It’s also the foundation on which the other tools are built.
Aggregation and Grouping, As with a one-dimensional NumPy array, for a Pandas Series the aggregates it in the terms first coined by Hadley Wickham of Rstats fame: split, apply, combine. updating the sum, mean, count, min, or other aggregate for each group along the way. The GroupBy object supports column indexing in the same way as the Merge df1 and df2 on the lkey and rkey columns. The value columns have the default suffixes, _x and _y, appended. >>> df1. merge (df2, left_on = 'lkey', right_on = 'rkey') lkey value_x rkey value_y 0 foo 1 foo 5 1 foo 1 foo 8 2 foo 5 foo 5 3 foo 5 foo 8 4 bar 2 bar 6 5 baz 3 baz 7
Merge, join, and concatenate, pandas provides various facilities for easily combining together Series or DataFrame with keys argument, unless it is passed, in which case the values will be selected (see below). The default behavior with join='outer' is to sort the other axis (columns in this case). This is also a valid argument to DataFrame.append() :. merge is a function in the pandas namespace, and it is also available as a DataFrame instance method merge(), with the calling DataFrame being implicitly considered the left object in the join. The related join() method, uses merge internally for the index-on-index (by default) and column(s)-on-index join.
Pandas transpose by column, We will specify that the first column should be used as the row index by passing NumPy offers a lot of array creationChords for Oso Panda. groupby() is a whole Validate textbox to accept only valid datetime value using DataAnnotations in DataFrame and transpose with T. #To select rows whose column value is in list From a SQL perspective, this case isn't grouping by 2 columns but grouping by 1 column and selecting based on an aggregate function of another column, e.g., SELECT FID_preproc, MAX(Shape_Area) FROM table GROUP BY FID_preproc. I mention this because pandas also views this as grouping by 1 column like SQL.
- Where did the
50from the 1st column, the
80in the 2nd and the
80values go? They are not present in the new DF.
- Thanks for the interest. That's the point- we need the first values of every unique ID.
- @Melsause But if you look at ID
first valueof the
unique ID 4is
80, yet you have it as
- Corrected, @pookie. Thanks.
- @Melsauce pandas.pydata.org/pandas-docs/stable/generated/…
- Is 'variable' a buffer column?
- @Melsauce it is the default name by melt