Fill missing date for each group and impute empty values in Pandas

pandas fill missing dates by group
filling missing data (imputation) by group
pandas groupby
pandas fillna
pandas ffill
remove all the rows in the dataset marks having 5 missing values
fill missing values pandas
pandas-groupby fill missing

For the following dataframe, how can I fill missing date for each group city and district, let's say full date range is from 2019/1/1 to 2019/6/1, then fill empty values with means before and after cells, if there are no values before or after it, then use bfill or ffill.

   city district      date  value
0     a        d  2019/1/1   9.99
1     a        d  2019/2/1  10.66
2     a        d  2019/3/1  10.56
3     a        d  2019/4/1  10.06
4     a        d  2019/5/1  10.69
5     a        d  2019/6/1  10.77
6     b        e  2019/1/1   9.72
7     b        e  2019/2/1   9.72
8     b        e  2019/4/1   9.78
9     b        e  2019/5/1   9.76
10    b        e  2019/6/1   9.66
11    c        f  2019/4/1   9.57
12    c        f  2019/5/1   9.47
13    c        f  2019/6/1   9.39

The expected result will like this:

   city district      date  value
0     a        d  2019/1/1   9.99
1     a        d  2019/2/1  10.66
2     a        d  2019/3/1  10.56
3     a        d  2019/4/1  10.06
4     a        d  2019/5/1  10.69
5     a        d  2019/6/1  10.77
6     b        e  2019/1/1   9.72
7     b        e  2019/2/1   9.72
8     b        e  2019/3/1   9.75
9     b        e  2019/4/1   9.78
10    b        e  2019/5/1   9.76
11    b        e  2019/6/1   9.66
12    c        f  2019/1/1   9.57
13    c        f  2019/2/1   9.57
14    c        f  2019/3/1   9.57
15    c        f  2019/4/1   9.57
16    c        f  2019/5/1   9.47
17    c        f  2019/6/1   9.39

How can I do that in Pandas? Thanks a lot.

Update: When I add freq = 'M', all become NaNs.

df['date']=pd.to_datetime(df['date'])
( df.set_index('date')
  .groupby(['city','district'],as_index=False)
  .apply(lambda x: x.reindex(pd.date_range(df.date.min(),df.date.max(), freq = 'M'))
                    .interpolate()
                    .bfill()
                    .ffill())
  .rename_axis(index = [0,'date'])
  .reset_index()
  .drop(0,axis=1)
)

Output:

         date  city  district  value
0  2019-01-31   NaN       NaN    NaN
1  2019-02-28   NaN       NaN    NaN
2  2019-03-31   NaN       NaN    NaN
3  2019-04-30   NaN       NaN    NaN
4  2019-05-31   NaN       NaN    NaN
5  2019-01-31   NaN       NaN    NaN
6  2019-02-28   NaN       NaN    NaN
7  2019-03-31   NaN       NaN    NaN
8  2019-04-30   NaN       NaN    NaN
9  2019-05-31   NaN       NaN    NaN
10 2019-01-31   NaN       NaN    NaN
11 2019-02-28   NaN       NaN    NaN
12 2019-03-31   NaN       NaN    NaN
13 2019-04-30   NaN       NaN    NaN
14 2019-05-31   NaN       NaN    NaN

You can change your solution with replace misisng values per groups for avoid wrong replacement if some only NaNs values per group:

df['date']=pd.to_datetime(df['date'])

rng = pd.date_range('2019-01-01', '2019-06-01', freq='MS')
c = df['city'].unique()
mux = pd.MultiIndex.from_product([c, rng], names=['city', 'date'])

df1 = (df.set_index(['city', 'date']).reindex(mux, method='ffill')
       .groupby(level=0)
       .apply(lambda x: x.bfill().ffill())
       .reset_index())
print (df1)
   city       date district  value
0     a 2019-01-01        d   9.99
1     a 2019-02-01        d  10.66
2     a 2019-03-01        d  10.56
3     a 2019-04-01        d  10.06
4     a 2019-05-01        d  10.69
5     a 2019-06-01        d  10.77
6     b 2019-01-01        e   9.72
7     b 2019-02-01        e   9.72
8     b 2019-03-01        e   9.72
9     b 2019-04-01        e   9.78
10    b 2019-05-01        e   9.76
11    b 2019-06-01        e   9.66
12    c 2019-01-01        e   9.66
13    c 2019-02-01        e   9.66
14    c 2019-03-01        e   9.66
15    c 2019-04-01        f   9.57
16    c 2019-05-01        f   9.47
17    c 2019-06-01        f   9.39

Or use custom function with reindex and method='bfill':

df2 = (df.set_index('date')
         .groupby(['city','district'], group_keys=False)
         .apply(lambda x: x.reindex(pd.date_range(df.date.min(),df.date.max(), freq='MS'), method='bfill')
                           .ffill())
         .rename_axis('date')
         .reset_index())
print (df2)
         date city district  value
0  2019-01-01    a        d   9.99
1  2019-02-01    a        d  10.66
2  2019-03-01    a        d  10.56
3  2019-04-01    a        d  10.06
4  2019-05-01    a        d  10.69
5  2019-06-01    a        d  10.77
6  2019-01-01    b        e   9.72
7  2019-02-01    b        e   9.72
8  2019-03-01    b        e   9.78
9  2019-04-01    b        e   9.78
10 2019-05-01    b        e   9.76
11 2019-06-01    b        e   9.66
12 2019-01-01    c        f   9.57
13 2019-02-01    c        f   9.57
14 2019-03-01    c        f   9.57
15 2019-04-01    c        f   9.57
16 2019-05-01    c        f   9.47
17 2019-06-01    c        f   9.39 

Solution with interpolate:

df2 = (df.set_index('date')
         .groupby(['city','district'], group_keys=False)
         .apply(lambda x: x.reindex(pd.date_range(df.date.min(),df.date.max(), freq='MS'))
                           .interpolate()
                           .bfill()
                           .ffill())
         .rename_axis('date')
         .reset_index())
print (df2)
         date city district  value
0  2019-01-01    a        d   9.99
1  2019-02-01    a        d  10.66
2  2019-03-01    a        d  10.56
3  2019-04-01    a        d  10.06
4  2019-05-01    a        d  10.69
5  2019-06-01    a        d  10.77
6  2019-01-01    b        e   9.72
7  2019-02-01    b        e   9.72
8  2019-03-01    b        e   9.75
9  2019-04-01    b        e   9.78
10 2019-05-01    b        e   9.76
11 2019-06-01    b        e   9.66
12 2019-01-01    c        f   9.57
13 2019-02-01    c        f   9.57
14 2019-03-01    c        f   9.57
15 2019-04-01    c        f   9.57
16 2019-05-01    c        f   9.47
17 2019-06-01    c        f   9.39

EDIT1: Solution only for one column:

df2 = (df.set_index('date')
         .groupby(['city','district'])['value']
         .apply(lambda x: x.reindex(pd.date_range(df.date.min(),df.date.max(), freq='MS'))
                           .interpolate()
                           .bfill()
                           .ffill())
         .rename_axis(['city','district','date'])
         .reset_index())
print (df2)
   city district       date  value
0     a        d 2019-01-01   9.99
1     a        d 2019-02-01  10.66
2     a        d 2019-03-01  10.56
3     a        d 2019-04-01  10.06
4     a        d 2019-05-01  10.69
5     a        d 2019-06-01  10.77
6     b        e 2019-01-01   9.72
7     b        e 2019-02-01   9.72
8     b        e 2019-03-01   9.75
9     b        e 2019-04-01   9.78
10    b        e 2019-05-01   9.76
11    b        e 2019-06-01   9.66
12    c        f 2019-01-01   9.57
13    c        f 2019-02-01   9.57
14    c        f 2019-03-01   9.57
15    c        f 2019-04-01   9.57
16    c        f 2019-05-01   9.47
17    c        f 2019-06-01   9.39  

Using Panda's “transform” and “apply” to deal with missing data on a , An example of groups within the data is gender. Bfill or backward-fill propagates the first observed non-null value backward In such cases, you would typically replace the missing values with your Aligning date ranges: Say you look at countries' GDP, education level, and population growth by year. Values considered “missing”¶ As data comes in many shapes and forms, pandas aims to be flexible with regard to handling missing data. While NaN is the default missing value marker for reasons of computational speed and convenience, we need to be able to easily detect this value with data of different types: floating point, integer, boolean, and general object.

We could do:

df['date']=pd.to_datetime(df['date'],format ='%YYYY/%dd/%mm' )

( df.set_index('date')
  .groupby(['city','district'],as_index=False)
  .apply(lambda x: x.reindex(pd.date_range(df.date.min(),df.date.max()))
                    .interpolate()
                    .bfill()
                    .ffill())
  .rename_axis(index = [0,'date'])
  .reset_index()
  .drop(0,axis=1)

)

Output

                  date city district  value
0  2019-01-01 00:01:00    a        d   9.99
1  2019-01-02 00:01:00    a        d  10.66
2  2019-01-03 00:01:00    a        d  10.56
3  2019-01-04 00:01:00    a        d  10.06
4  2019-01-05 00:01:00    a        d  10.69
5  2019-01-06 00:01:00    a        d  10.77
6  2019-01-01 00:01:00    b        e   9.72
7  2019-01-02 00:01:00    b        e   9.72
8  2019-01-03 00:01:00    b        e   9.75
9  2019-01-04 00:01:00    b        e   9.78
10 2019-01-05 00:01:00    b        e   9.76
11 2019-01-06 00:01:00    b        e   9.66
12 2019-01-01 00:01:00    c        f   9.57
13 2019-01-02 00:01:00    c        f   9.57
14 2019-01-03 00:01:00    c        f   9.57
15 2019-01-04 00:01:00    c        f   9.57
16 2019-01-05 00:01:00    c        f   9.47
17 2019-01-06 00:01:00    c        f   9.39

Working with missing data — pandas 1.1.0 documentation, Because NaN is a float, a column of integers with even one missing values is cast to floating-point dtype (see The sum of an empty or all-NA Series or column of a DataFrame is 0. NA groups in GroupBy are automatically excluded. fillna() can “fill in” NA values with non-NA data in a couple of ways, which we illustrate:. value : Static, dictionary, array, series or dataframe to fill instead of NaN. method : Method is used if user doesn’t pass any value. Pandas has different methods like bfill, backfill or ffill which fills the place with value in the Forward index or Previous/Back respectively.

This solution:

df['date']=pd.to_datetime(df['date'])

rng = pd.date_range('2019-01-01', '2019-06-01', freq='MS')
c = df['city'].unique()
mux = pd.MultiIndex.from_product([c, rng], names=['city', 'date'])


print(df.set_index(['city', 'date']).reindex(mux).groupby(level=0)\
        .bfill()\
        .ffill()\
        .reset_index())

Output:

   city       date district  value
0     a 2019-01-01        d   9.99
1     a 2019-02-01        d  10.66
2     a 2019-03-01        d  10.56
3     a 2019-04-01        d  10.06
4     a 2019-05-01        d  10.69
5     a 2019-06-01        d  10.77
6     b 2019-01-01        e   9.72
7     b 2019-02-01        e   9.72
8     b 2019-03-01        e   9.78
9     b 2019-04-01        e   9.78
10    b 2019-05-01        e   9.76
11    b 2019-06-01        e   9.66
12    c 2019-01-01        f   9.57
13    c 2019-02-01        f   9.57
14    c 2019-03-01        f   9.57
15    c 2019-04-01        f   9.57
16    c 2019-05-01        f   9.47
17    c 2019-06-01        f   9.39

Working with missing data — pandas 0.12.0 documentation, Thus, values prior to the start date would generally be marked as missing. In pandas NA groups in GroupBy are automatically excluded. The fillna function can “fill in” NA values with non-null data in a couple of ways, which we illustrate:. Fill NA/NaN values using the specified method. Parameters value scalar, dict, Series, or DataFrame. Value to use to fill holes (e.g. 0), alternately a dict/Series/DataFrame of values specifying which value to use for each index (for a Series) or column (for a DataFrame). Values not in the dict/Series/DataFrame will not be filled.

Filling missing data (imputation) by group, Here is an example of Filling missing data (imputation) by group: Many statistical and Dealing with missing data is natural in pandas (both in using the default each group with a custom function to call .fillna() and impute the median value. That being said, maybe you just want to fill in missing values with a single value. # Replace missing values with a number df['ST_NUM'].fillna(125, inplace=True) More likely, you might want to do a location based imputation.

Pandas fill in missing date within each group with information in the , Instead of filling in missing dates for each group between the min and max As you can see, in sub_id=1, a row was added for 2016-01-02 and amount was imputed at the last observed non-null value forward until pandas.core.groupby. 6.4.2. Univariate feature imputation¶. The SimpleImputer class provides basic strategies for imputing missing values. Missing values can be imputed with a provided constant value, or using the statistics (mean, median or most frequent) of each column in which the missing values are located.

Python Pandas - Missing Data, Cleaning / Filling Missing Data. Pandas provides various methods for cleaning the missing values. The fillna function can “fill in” NA values with non-null data in a� Imputation: Deal with missing data points by substituting new values. Common strategy: replace each missing value in a feature with the mean, median, or mode of the feature.

Comments
  • Thanks a lot, in your second solution, is it possible add .interpolate()?
  • @ahbon - yes, only is necessary remove method parameter in reindex for missing values
  • Another question, if I want use apply and lambda functions to only value1 column, if df has other columns value2, value3, value4, ect. I have tried with df.set_index('date') .groupby(['city','district'], group_keys=False)['value1'] .apply(lambda x: x.reindex(pd.date_range(df.date.min(),df.date.max(), freq='MS'), method='bfill') .ffill()) .rename_axis('date') .reset_index(). It doesn't work.
  • @ahbon - Solution was edited for processing only one column value. all another columns are removed in output.
  • Great, this is much quicker.
  • Could you please add output?
  • Can we use ( df.set_index('date') .groupby(['city','district'],as_index=False)['value'] .apply(lambda x: x.reindex(pd.date_range(df.date.min(),df.date.max())) .interpolate() .bfill() .ffill()) .rename_axis(index = [0,'date']) .reset_index() .drop(0,axis=1) )? My real data have more than value column, it become all NaNs after apply your code.
  • Sorry, just noticed one issue, my expected date increased by month, but your output inscreased by date.