grouping time-series data based on starting and ending date

datetime64(ns)
datetimeindex
pandas time series
pandas datetimeindex slice
working with dates in pandas

I have time-series data of a yearly sports tournament, with the date when each game was played. I want to group the games by the season(year) they were played in. Each season starts in August and ends the NEXT year in july.

How would I go about grouping the games by season, like - season(2016-2017), season(2017-2018), etc..

This Answer involving df.resample() may be related, but I'm not sure how I'd go about doing it.

This is what the date column looks like:

DATE
26/09/09
04/10/09
17/10/09
25/10/09
31/10/09
  ...   
29/09/18
07/10/18
28/10/18
03/11/18

I want to group by seasons so that I can perform visualization operations over the aggregated data.

UPDATE: For the time being my solution is to split up the dataframe into groups of 32 as I know each season has 32 games. This is the code I've used:

split_df = np.array_split(df, np.arange(0, len(df),32))

But I'd rather prefer something more elegant and more inclusive of time-series data so I'll keep the question open.

The key to success is proper groupping, in your case pd.Grouper(key='DATA', freq='AS-AUG').

Note that freq='AS-AUG' states that your groups should start from the start of August each year.

Look at the following script:

import pandas as pd

# Source columns
dates = [ '01/04/09', '31/07/09', '01/08/09', '26/09/09', '04/10/09', '17/12/09', 
    '25/01/10', '20/04/10', '31/07/10', '01/08/10', '28/10/10', '03/11/10',
    '25/12/10', '20/04/11', '31/07/11' ]
scores_x = np.random.randint(0, 20, len(dates))
scores_y = np.random.randint(0, 20, len(dates))
# Source DataFrame
df = pd.DataFrame({'DATA': dates, 'SCORE_X': scores_x, 'SCORE_Y': scores_y})
# Convert string date to datetime
df.DATA = pd.to_datetime(df.DATA, format='%d/%m/%y')
# Groupping
gr = df.groupby(pd.Grouper(key='DATA', freq='AS-AUG'))

If you print the results:

for name, group in gr:
    print()
    print(name)
    print(group)

you will get:

2008-08-01 00:00:00
        DATA  SCORE_X  SCORE_Y
0 2009-04-01       16       11
1 2009-07-31       10        7

2009-08-01 00:00:00
        DATA  SCORE_X  SCORE_Y
2 2009-08-01       19        6
3 2009-09-26       14        5
4 2009-10-04        8       11
5 2009-12-17       12       19
6 2010-01-25        0        0
7 2010-04-20       17        6
8 2010-07-31       18        2

2010-08-01 00:00:00
         DATA  SCORE_X  SCORE_Y
9  2010-08-01       15       18
10 2010-10-28        2        4
11 2010-11-03        8       16
12 2010-12-25       13        1
13 2011-04-20       19        7
14 2011-07-31        8        3

As you can see, each group starts just on 1-st of August and ends on 31-st of July.

They you can do with your groups whatever you want.

Working with Time Series, Python's basic objects for working with dates and times reside in the built-in datetime module. From a group of these Timestamp objects, Pandas can construct a Where the Pandas time series tools really become useful is when you begin to index data by Here we will resample the data at the end of business year:. The indicator sources we have are changes closed or changes opened. So, if I want for example a 12 month trend report, of changes cancelled, it captures the data, but, it isn't grouping it by month based on the planned start or end dates (depending what i need). It is grouping it based on when it was closed.

Use -

df.groupby(df['DATE'].dt.year).count()

Output

    DATE
DATE    
2009    5
2018    4

Custom Season Grouping

min_year = df['DATE'].dt.year.min()
max_year = df['DATE'].dt.year.max()
rng = pd.date_range(start='{}-07'.format(min_year), end='{}-08'.format(max_year), freq='12M').to_series()
df.groupby(pd.cut(df['DATE'], rng)).count()

Output

    DATE
DATE    
(2009-07-31, 2010-07-31]    3
(2010-07-31, 2011-07-31]    0
(2011-07-31, 2012-07-31]    0
(2012-07-31, 2013-07-31]    0
(2013-07-31, 2014-07-31]    0
(2014-07-31, 2015-07-31]    0
(2015-07-31, 2016-07-31]    0
(2016-07-31, 2017-07-31]    0
(2017-07-31, 2018-07-31]    1

Group Data By Time, Next, let's create some sample data that we can group by time as an sample. One column is a date, the second column is a numeric value. variable for today base = datetime.datetime.today() # Create a list variable that creates |CBM | custom business month end frequency |MS | month start frequency  Learning Objectives. After completing this tutorial, you will be able to: Explain several ways to manipulate data using functions in the dplyr package in R.; Use group-by(), summarize(), and mutate() functions.

Resampling using 'A-JUL' as an anchored offset alias should do the trick:

>>> df
            SAMPLE
DATE              
2009-01-30       1
2009-07-10       4
2009-11-20       3
2010-01-01       5
2010-05-13       1
2010-08-01       1
>>> df.resample('A-JUL').sum()
            SAMPLE
DATE              
2009-07-31       5
2010-07-31       9
2011-07-31       1

A indicates it is a yearly interval, -JUL indicates it ends in July.

Tutorial: Time Series Analysis with Pandas – Dataquest, In this post, we illustate what time series data is and how you can harness the power Start learning for free in our Python for Data Science: Fundamentals course. As we can see, to_datetime() automatically infers a date/time format based on the input. We'll first group the data by month, to visualize yearly seasonality. This lesson introduces the mutate() and group_by() dplyr functions - which allow you to aggregate or summarize time series data by a particular field - in this case you will aggregate data by day to get daily precipitation totals for Boulder during the 2013 floods.

You could build a season column and group by that. In below code, I used pandas.DateOffset() to move all dates 7 months back so a game that happened in August would look like it happened in January to align the season year with the calendar year. Building season string is fairly straightforward after that.

import pandas as pd
from datetime import date

dates = pd.date_range(date(2009, 8, 1), date(2018, 7, 30), freq='17d')
df = pd.DataFrame(dates, columns=['date'])

# copy the date column to a separate dataframe to do the work
df_tmp = df[['date']]
df_tmp['season_start_year'] = (df_tmp['date'] - pd.DateOffset(months=7)).dt.year
df_tmp['season_end_year'] = df_tmp['season_start_year'] + 1
df_tmp['season'] = df_tmp['season_start_year'].map(str) + '-' + df_tmp['season_end_year'].map(str)

# copy season column to the main dataframe
df['season'] = df_tmp['season']

df.groupby('season').count()

Time series / date functionality, For time series data, it's conventional to represent the time component in the The start and end dates are strictly inclusive, so dates outside of those resample() is a time-based groupby, followed by a reduction method on each of its groups. Below are the steps you need to follow to group dates in a pivot table. Select any of the cells from the date column. Right click on it and select group. You will get a pop-up window to group dates. Select “Month” in the group by option and then click OK.

3 Ways to Group Times in Excel, Learn how to group times in a data set for summarized reports and charts. different ways to summarize data based on time groups (increments I explain how to group dates in the 3rd video of that series. This is the start of the calendar in Excel. Employee 1 | Date: 07/12/2018 | Start: 8:30 | End 15:35 Bottom line: Learn how to group times in a data set for summarized reports and charts. We will use Pivot Tables, and the FLOOR, TRUNC, and VLOOKUP functions for the different solutions. Skill level: Intermediate. In this article we are going to learn a few different ways to summarize data based on time groups (increments of hours or minutes).

Summarize Time Series Data by Month or Year Using Tidyverse , Learn how to summarize time series data by day, month or year with Tidyverse pipes in R. To begin, load the ggplot2 and dplyr libraries. Also, set When you use facet wrap, you select a column in your data that you wish to “group by”. This date range is approximately the end of August - Oct (2013). The problem is that the duration of periods can be mesured in seconds, minutes, days or whatever - we have to calculate it based on just the start/end/number_of_periods. Thanks for help, June 30, 2009 - 3:34 pm UTC

Blog - Generating Data between Two Dates, The billing data contained the customer name, price, and start/end date of the step by step how to generate transactions based on a start and end date. Resample is a convenience method for grouping time series data. Hi Everyone, I’ve got data set with the client name and their service start and end date (end date can be futuristic as well). Now i want to know the total number of active clients. Active clients means, those clients who’s 1. end date is greater that today 2. End date is missing and lastly.

Comments
  • Can you give us some (sample) data to work with?
  • The question mentions that the season starts in August and ends in July
  • So 1 season will have half of the games in one year, and the other half the next.. so for that season games will be in 2009 AND 2010. The next season it'll be 2010 AND 2011. I'm guessing this wont work then.
  • @AakashDusane updated ans pls check