How to merge columns after groupby and selecting first valid value of other columns in a pandas dataframe?

pandas groupby aggregate multiple columns
pandas groupby multiple columns
pandas groupby transform
pandas groupby apply
pandas aggregate
pandas groupby keep columns
pandas groupby tutorial
pandas transform

I have a pandas dataframe of the form:

df

    ID    col_1    col_2    col_3    Date
     1              20       40      1/1/2018
     1     10                        1/2/2018
     1     50                60      1/3/2018
     3     40       10       90      1/1/2018
     4              80       80      1/1/2018

The problem is, I need to create a new dataframe with the first valid values for each column BUT also additional columns derived from 'Date', which correspond to the time those values were matched by in the original dataframe.

In other words:

new_df

    ID    first_col_1    Date_col_1    first_col_2    Date_col_2    first_col_3    Date_col_3
    1         10          1/2/2018          20         1/1/2018         40         1/1/2018 
    3         40          1/1/2018          10         1/1/2018         90         1/1/2018 
    4                     1/1/2018          80         1/1/2018         80         1/1/2018

I understand getting the first valid value per column per ID is as simple as

df.groupby('ID').first()

But how do I extract the relevant 'Date' information for each column?

You don't need to loop, but you do need to "melt" your dataframe before your group-by operation.

So starting with:

from io import StringIO
import pandas
f = StringIO("""\
ID,col_1,col_2,col_3,Date
1,,20,40,1/1/2018
1,10,,,1/2/2018
1,50,,60,1/3/2018
3,40,10,90,1/1/2018
4,,80,80,1/1/2018
""")

df = pandas.read_csv(f)

You can then:

print(
    df.melt(id_vars=['ID', 'Date'], value_vars=['col_1', 'col_2', 'col_3'], value_name='first')
      .groupby(by=['ID', 'variable'])
      .first()
      .unstack(level='variable')
)

Which gives you:

              Date                     first            
variable     col_1     col_2     col_3 col_1 col_2 col_3
ID                                                      
1         1/1/2018  1/1/2018  1/1/2018  10.0  20.0  40.0
3         1/1/2018  1/1/2018  1/1/2018  40.0  10.0  90.0
4         1/1/2018  1/1/2018  1/1/2018   NaN  80.0  80.0

The columns are multi-level, so you we can put some polish on them if you want:

def flatten_columns(df, sep='_'):
    newcols = [sep.join(_) for _ in df.columns]
    return df.set_axis(newcols, axis='columns', inplace=False)

print(
    df.melt(id_vars=['ID', 'Date'], value_vars=['col_1', 'col_2', 'col_3'], value_name='first')
      .groupby(by=['ID', 'variable'])
      .first()
      .unstack(level='variable')
      .sort_index(level='variable', axis='columns')
      .pipe(flatten_columns)
)

Which gives you something with not quite the same column order as your example, but it's as close as I feel like making it.

   Date_col_1  first_col_1 Date_col_2  first_col_2 Date_col_3  first_col_3
ID                                                                        
1    1/1/2018         10.0   1/1/2018         20.0   1/1/2018         40.0
3    1/1/2018         40.0   1/1/2018         10.0   1/1/2018         90.0
4    1/1/2018          NaN   1/1/2018         80.0   1/1/2018         80.0

Group By: split-apply-combine, Filling NAs within groups with a value derived from each group. Creating the GroupBy object only verifies that you've passed a valid mapping. GroupBy will tab complete column names (and other attributes): The values are tuples whose first element is the column to select and the second element is the aggregation to  pandas.DataFrame.combine_first ¶ DataFrame.combine_first(self, other: ‘DataFrame’) → ’DataFrame’ [source] ¶ Update null elements with value in the same location in other. Combine two DataFrame objects by filling null values in one DataFrame with non-null values from other DataFrame.

IIUC using melt before groupby

newdf=df.melt(['ID','Date']).loc[lambda x : x.value!='']

newdf=  newdf.groupby(['ID','variable']).first().unstack().sort_index(level=1,axis=1)

newdf.columns=newdf.columns.map('_'.join)
newdf
   Date_col_1  value_col_1 Date_col_2  value_col_2 Date_col_3  value_col_3
ID                                                                        
1    1/2/2018         10.0   1/1/2018         20.0   1/1/2018         40.0
3    1/1/2018         40.0   1/1/2018         10.0   1/1/2018         90.0
4        None          NaN   1/1/2018         80.0   1/1/2018         80.0

Group By: split-apply-combine, Starting with 0.8, pandas Index objects now supports duplicate values. Series([​1, 2, 3, 10, 20, 30], lst) In [9]: grouped = s.groupby(level=0) In [10]: grouped.first() Out[10]: 1 1 2 2 Creating the GroupBy object only verifies that you've passed a valid mapping. GroupBy will tab complete column names (and other attributes). A very simple example can be grouping by a specific column value (eg. “species” in our table) and summing all the remaining applicable columns: df.groupby(‘species’).sum() Alternatively, we can specify which columns are to be summed up.

I think you have to loop over the columns, and extract the first values for each of them before concatenating. I can't see a simpler way to do that.

# Create a list to store the dataframes you want for each column
sub_df = [pd.DataFrame(df['ID'].unique(), columns=['ID'])]  # Init this list with IDs

for col in df.columns[1:-1]:  # loop over the columns (except ID and Date)

    # Determine the first valid rows indexes for this column (group by ID)
    valid_rows = df.groupby('ID')[col].apply(lambda sub_df: sub_df.first_valid_index())

    # Extracting the values and dates corresponding to these rows
    new_sub_df = df[[col, 'Date']].ix[valid_rows].reset_index(drop=True)

    # Append to the list of sub DataFrames
    sub_df.append(new_sub_df)

# Concatenate all these DataFrames.
new_df = pd.concat(sub_df, axis=1)

Pandas GroupBy: Your Guide to Grouping Data in Python – Real , This tutorial assumes you have some experience with Pandas itself, including how The dataset contains members' first and last names, birth date, In SQL, you could find this answer with a SELECT statement: You call .groupby() and pass the name of the column you want to group on, which is "state" . The only difference between the two is the order of the columns: the first input’s columns will always be the first in the newly formed DataFrame. merge() is the most complex of the Pandas data combination tools. It’s also the foundation on which the other tools are built.

Aggregation and Grouping, As with a one-dimensional NumPy array, for a Pandas Series the aggregates it in the terms first coined by Hadley Wickham of Rstats fame: split, apply, combine. updating the sum, mean, count, min, or other aggregate for each group along the way. The GroupBy object supports column indexing in the same way as the​  Merge df1 and df2 on the lkey and rkey columns. The value columns have the default suffixes, _x and _y, appended. >>> df1. merge (df2, left_on = 'lkey', right_on = 'rkey') lkey value_x rkey value_y 0 foo 1 foo 5 1 foo 1 foo 8 2 foo 5 foo 5 3 foo 5 foo 8 4 bar 2 bar 6 5 baz 3 baz 7

Merge, join, and concatenate, pandas provides various facilities for easily combining together Series or DataFrame with keys argument, unless it is passed, in which case the values will be selected (see below). The default behavior with join='outer' is to sort the other axis (columns in this case). This is also a valid argument to DataFrame.​append() :. merge is a function in the pandas namespace, and it is also available as a DataFrame instance method merge(), with the calling DataFrame being implicitly considered the left object in the join. The related join() method, uses merge internally for the index-on-index (by default) and column(s)-on-index join.

Pandas transpose by column, We will specify that the first column should be used as the row index by passing NumPy offers a lot of array creationChords for Oso Panda. groupby() is a whole Validate textbox to accept only valid datetime value using DataAnnotations in DataFrame and transpose with T. #To select rows whose column value is in list​  From a SQL perspective, this case isn't grouping by 2 columns but grouping by 1 column and selecting based on an aggregate function of another column, e.g., SELECT FID_preproc, MAX(Shape_Area) FROM table GROUP BY FID_preproc. I mention this because pandas also views this as grouping by 1 column like SQL.

Comments
  • Where did the 50 from the 1st column, the 80 in the 2nd and the 60 and 80 values go? They are not present in the new DF.
  • Thanks for the interest. That's the point- we need the first values of every unique ID.
  • @Melsause But if you look at ID 4, the first value of the unique ID 4 is 80, yet you have it as 20 in your new_df example.
  • Corrected, @pookie. Thanks.
  • @Melsauce pandas.pydata.org/pandas-docs/stable/generated/…
  • Is 'variable' a buffer column?
  • @Melsauce it is the default name by melt