Split date range rows into years (ungroup) - Python Pandas

pandas split dataframe by rows
split dataframe into multiple dataframes pandas
pandas groupby
pandas split dataframe by condition
pandas groupby merge rows
pandas groupby aggregate multiple columns
pandas groupby tutorial
pandas groupby multiple columns

I have a dataframe like this:

    Start date  end date        A    B
    01.01.2020  30.06.2020      2    3
    01.01.2020  31.12.2020      3    1
    01.04.2020  30.04.2020      6    2
    01.01.2021  31.12.2021      2    3
    01.07.2020  31.12.2020      8    2
    01.01.2020  31.12.2023      1    2
    .......

I would like to split the rows where end - start > 1 year (see last row where end=2023 and start = 2020), keeping the same value for column A, while splitting proportionally the value in column B:

    Start date  end date        A    B
    01.01.2020  30.06.2020      2    3
    01.01.2020  31.12.2020      3    1
    01.04.2020  30.04.2020      6    2
    01.01.2021  31.12.2021      2    3
    01.07.2020  31.12.2020      8    2
    01.01.2020  31.12.2020      1    2/4
    01.01.2021  31.12.2021      1    2/4
    01.01.2022  31.12.2022      1    2/4
    01.01.2023  31.12.2023      1    2/4
    .......

Any idea?

Here is my solution. See the comments below:

import io

# TEST DATA:
text="""     start         end      A      B 
        01.01.2020  30.06.2020      2      3 
        01.01.2020  31.12.2020      3      1 
        01.04.2020  30.04.2020      6      2 
        01.01.2021  31.12.2021      2      3 
        01.07.2020  31.12.2020      8      2
        31.12.2020  20.01.2021     12     12
        31.12.2020  01.01.2021     22     22
        30.12.2020  01.01.2021     32     32
        10.05.2020  28.09.2023     44     44
        27.11.2020  31.12.2023     88     88
        31.12.2020  31.12.2023    100    100
        01.01.2020  31.12.2021    200    200
      """

df= pd.read_csv(io.StringIO(text), sep=r"\s+", engine="python", parse_dates=[0,1])
#print("\n----\n df:",df)

#----------------------------------------
# SOLUTION:

def split_years(r):
    """
        Split row 'r' where "end"-"start" greater than 0.
        The new rows have repeated values of 'A', and 'B' divided by the number of years.
        Return: a DataFrame with rows per year.
    """
    t1,t2 = r["start"], r["end"]
    ys= t2.year - t1.year
    kk= 0 if t1.is_year_end else 1
    if ys>0:
        l1=[t1] + [ t1+pd.offsets.YearBegin(i) for i in range(1,ys+1) ]
        l2=[ t1+pd.offsets.YearEnd(i) for i in range(kk,ys+kk) ] + [t2]
        return pd.DataFrame({"start":l1, "end":l2, "A":r.A,"B": r.B/len(l1)})
    print("year difference <= 0!")
    return None


# Create two groups, one for rows where the 'start' and 'end' is in the same year, and one for the others:
grps= df.groupby(lambda idx: (df.loc[idx,"start"].year-df.loc[idx,"end"].year)!=0 ).groups 
print("\n---- grps:\n",grps)

# Extract the "one year" rows in a data frame:
df1= df.loc[grps[False]]
#print("\n---- df1:\n",df1)

# Extract the rows to be splitted:
df2= df.loc[grps[True]]
print("\n---- df2:\n",df2)

# Split the rows and put the resulting data frames into a list:
ldfs=[ split_years(df2.loc[row]) for row in df2.index ]
print("\n---- ldfs:")
for fr in ldfs:
    print(fr,"\n")

# Insert the "one year" data frame to the list, and concatenate them:    
ldfs.insert(0,df1)
df_rslt= pd.concat(ldfs,sort=False)
#print("\n---- df_rslt:\n",df_rslt)

# Housekeeping:
df_rslt= df_rslt.sort_values("start").reset_index(drop=True)
print("\n---- df_rslt:\n",df_rslt)

Outputs:

---- grps:
 {False: Int64Index([0, 1, 2, 3, 4], dtype='int64'), True: Int64Index([5, 6, 7, 8, 9, 10, 11], dtype='int64')}

---- df2:
         start        end    A    B
5  2020-12-31 2021-01-20   12   12
6  2020-12-31 2021-01-01   22   22
7  2020-12-30 2021-01-01   32   32
8  2020-10-05 2023-09-28   44   44
9  2020-11-27 2023-12-31   88   88
10 2020-12-31 2023-12-31  100  100
11 2020-01-01 2021-12-31  200  200

---- ldfs:
       start        end   A    B
0 2020-12-31 2020-12-31  12  6.0
1 2021-01-01 2021-01-20  12  6.0 

       start        end   A     B
0 2020-12-31 2020-12-31  22  11.0
1 2021-01-01 2021-01-01  22  11.0 

       start        end   A     B
0 2020-12-30 2020-12-31  32  16.0
1 2021-01-01 2021-01-01  32  16.0 

       start        end   A     B
0 2020-10-05 2020-12-31  44  11.0
1 2021-01-01 2021-12-31  44  11.0
2 2022-01-01 2022-12-31  44  11.0
3 2023-01-01 2023-09-28  44  11.0 

       start        end   A     B
0 2020-11-27 2020-12-31  88  22.0
1 2021-01-01 2021-12-31  88  22.0
2 2022-01-01 2022-12-31  88  22.0
3 2023-01-01 2023-12-31  88  22.0 

       start        end    A     B
0 2020-12-31 2020-12-31  100  25.0
1 2021-01-01 2021-12-31  100  25.0
2 2022-01-01 2022-12-31  100  25.0
3 2023-01-01 2023-12-31  100  25.0 

       start        end    A      B
0 2020-01-01 2020-12-31  200  100.0
1 2021-01-01 2021-12-31  200  100.0 


---- df_rslt:
         start        end    A      B
0  2020-01-01 2020-06-30    2    3.0
1  2020-01-01 2020-12-31    3    1.0
2  2020-01-01 2020-12-31  200  100.0
3  2020-01-04 2020-04-30    6    2.0
4  2020-01-07 2020-12-31    8    2.0
5  2020-10-05 2020-12-31   44   11.0
6  2020-11-27 2020-12-31   88   22.0
7  2020-12-30 2020-12-31   32   16.0
8  2020-12-31 2020-12-31   12    6.0
9  2020-12-31 2020-12-31  100   25.0
10 2020-12-31 2020-12-31   22   11.0
11 2021-01-01 2021-12-31  100   25.0
12 2021-01-01 2021-12-31   88   22.0
13 2021-01-01 2021-12-31   44   11.0
14 2021-01-01 2021-01-01   32   16.0
15 2021-01-01 2021-01-01   22   11.0
16 2021-01-01 2021-01-20   12    6.0
17 2021-01-01 2021-12-31    2    3.0
18 2021-01-01 2021-12-31  200  100.0
19 2022-01-01 2022-12-31   88   22.0
20 2022-01-01 2022-12-31  100   25.0
21 2022-01-01 2022-12-31   44   11.0
22 2023-01-01 2023-09-28   44   11.0
23 2023-01-01 2023-12-31   88   22.0
24 2023-01-01 2023-12-31  100   25.0

Groupby, split-apply-combine and pandas, In this tutorial, you'll learn how to use the pandas groupby operation, which Intuitively, you want to split the dataset into groups, one for each year, and then of split-apply-combine to check out how Netflix movie ranges vary as a Applying it below shows that you have 1000 rows and 7 columns of data,� Pandas: How to split dataframe on a month basis. You can see the dataframe on the picture below. Initially the columns: "day", "mm", "year" don't exists. We are going to split the dataframe into several groups depending on the month. For that purpose we are splitting column date into day, month and year. After that we will group on the month

Bit of a different approach, adding new columns instead of new rows. But I think this accomplishes what you want to do.

df["years_apart"] = (
    (df["end_date"] - df["start_date"]).dt.days / 365
).astype(int)

for years in range(1, df["years_apart"].max().astype(int)):
    df[f"{years}_end_date"] = pd.NaT
    df.loc[
        df["years_apart"] == years, f"{years}_end_date"
    ] = df.loc[
        df["years_apart"] == years, "start_date"
    ]  + dt.timedelta(days=365*years)

df["B_bis"] = df["B"] / df["years_apart"]

Output

start_date     end_date    years_apart     1_end_date   2_end_date   ... 
2018-01-01    2018-01-02      0            NaT          NaT
2018-01-02    2019-01-02      1            2019-01-02   NaT
2018-01-03    2020-01-03      2            NaT          2020-01-03

Group By: split-apply-combine — pandas 1.0.5 documentation, In fact, in many situations we may wish to split the data set into groups and do something with those groups. A dict or Series , providing a label -> group name mapping. These will split the DataFrame on its index (rows). Original Data In [95]: grouped = ts.groupby(lambda x: x.year) In [96]: grouped.mean() Out[96]: 2000� # Note date_range is inclusive of the end date ref on a date in a datetime column. split_date = pd.datetime how-to-find-which-columns-contain-any-nan-value-in-pandas-dataframe-python.

I have solved it creating a date difference and a counter that adds years to the repeated rows:

#calculate difference between start and end year
table['diff'] = (table['end'] - table['start'])//timedelta(days=365)
table['diff'] = table['diff']+1

#replicate rows depending on number of years
table = table.reindex(table.index.repeat(table['diff']))


#counter that increase for diff>1, assign increasing years to the replicated rows
table['count'] = table['diff'].groupby(table['diff']).cumsum()//table['diff']
table['start'] = np.where(table['diff']>1, table['start']+table['count']-1, table['start'])
table['end'] = table['start']

#split B among years
table['B'] = table['B']//table['diff']

Group By: split-apply-combine — pandas 0.25.1 documentation, In fact, in many situations we may wish to split the data set into groups and do something with those groups. A dict or Series , providing a label -> group name mapping. These will split the DataFrame on its index (rows). Original Data In [95]: grouped = ts.groupby(lambda x: x.year) In [96]: grouped.mean() Out[96]: 2000� Broadly, methods of a Pandas GroupBy object fall into a handful of categories: Aggregation methods (also called reduction methods) “smush” many data points into an aggregated statistic about those data points. An example is to take the sum, mean, or median of 10 numbers, where the result is just a single number.

Group By: split-apply-combine — pandas 0.23.4 documentation, In fact, in many situations we may wish to split the data set into groups and do something with those groups. A dict or Series , providing a label -> group name mapping. These will split the DataFrame on its index (rows). 4th, and last date index for each month In [185]: df.groupby([df.index.year, df.index.month]).nth ([0,� A step-by-step Python code example that shows how to extract month and year from a date column and put the values into new columns in Pandas. Provided by Data Interview Questions, a mailing list for coding and data interview problems.

How to Group Dates in a Pivot Table by [Year, Quarter, Month, and , In this tutorial, You will learn How To Group Dates In Pivot Table to better data you can use years, quarters, time and even a custom date range for grouping. back your dates or want to ungroup dates you can do that with “ungroup' option. Split Data into Groups. Pandas object can be split into any of their objects. There are multiple ways to split an object like − obj.groupby('key') obj.groupby(['key1','key2']) obj.groupby(key,axis=1) Let us now see how the grouping objects can be applied to the DataFrame object. Example

5 Data transformation, It tells you that dplyr overwrites some functions in base R. If you want to use the have used in the past: it only shows the first few rows and all the columns that fit filter(flights, month == 1, day == 1) #> # A tibble: 842 x 19 #> year month day useful transformation for dealing with data that ranges across multiple orders of� Pandas – Python Data Analysis Library. I’ve recently started using Python’s excellent Pandas library as a data analysis tool, and, while finding the transition from R’s excellent data.table library frustrating at times, I’m finding my way around and finding most things work quite well.

Comments
  • can you post your output please?