Pandas Dataframe - for each row, return count of other rows with overlapping dates

pandas rolling
pandas groupby
pandas shift
pandas dataframe between dates
pandas resample
pandas between time
pandas merge
pandas series

I've got a dataframe with projects, start dates, and end dates. For each row I would like to return the number of other projects in process when the project started. How do you nest loops when using df.apply()? I've tried using a for loop but my dataframe is large and it takes way too long.

import datetime as dt

data = {'project' :['A', 'B', 'C'],
        'pr_start_date':[dt.datetime(2018, 9, 1), dt.datetime(2019, 4, 1), dt.datetime(2019, 6, 8)],
        'pr_end_date': [dt.datetime(2019, 6, 15), dt.datetime(2019, 12, 1), dt.datetime(2019, 8, 1)]}

df = pd.DataFrame(data)

def cons_overlap(start):
    overlaps = 0
    for i in df.index:
        other_start = df.loc[i, 'pr_start_date']
        other_end = df.loc[i, 'pr_end_date']
        if (start > other_start) & (start < other_end):
            overlaps += 1

    return overlaps

df['overlap'] = df.apply(lambda row: cons_overlap(row['pr_start_date']), axis=1)

This is the output I'm looking for:

    pr  pr_start_date pr_end_date   overlap
0   A   2018-09-01    2019-06-15    0
1   B   2019-04-01    2019-12-01    1
2   C   2019-06-08    2019-08-01    2

I suggest you take advantage of numpy broadcasting:

ends = df.pr_start_date.values < df.pr_end_date.values[:, None]
starts = df.pr_start_date.values > df.pr_start_date.values[:, None]
df['overlap'] = (ends & starts).sum(0)
print(df)

Output

  project pr_start_date pr_end_date  overlap
0       A    2018-09-01  2019-06-15        0
1       B    2019-04-01  2019-12-01        1
2       C    2019-06-08  2019-08-01        2

Both ends and starts are matrices of 3x3 that are truth when the condition is met:

# ends   
[[ True  True  True]  
 [ True  True  True]
 [ True  True  True]]

# starts
[[False  True  True]
 [False False  True]
 [False False False]]

Then find the intersection with the logical & and sum across columns (sum(0)).

pandas.DataFrame.between_time, Returns. Series or DataFrame. Raises. TypeError. If the index is not a DatetimeIndex. See also Select initial periods of time series based on a date offset. last. Iterating over rows and columns in Pandas DataFrame Iteration is a general term for taking each item of something, one after another. Pandas DataFrame consists of rows and columns so, in order to iterate over dataframe, we have to iterate a dataframe like a dictionary.


it should be faster than your for loop

pandas.Interval.overlaps, Two intervals overlap if they share a common point, including closed endpoints. Parameters. otherInterval. Interval to check against for an overlap. Returns. Create a Dataframe Contents of the Dataframe : Name Age City Experience a jack 34.0 Sydney 5.0 b Riti 31.0 Delhi 7.0 c Aadi 16.0 NaN 11.0 d Mohit NaN Delhi 15.0 e Veena 33.0 Delhi 4.0 f Shaunak 35.0 Mumbai NaN g Shaun 35.0 Colombo 11.0 **** Get the row count of a Dataframe using Dataframe.shape Number of Rows in dataframe : 7 **** Get the row


I assume the rows are sorted by the start date, and check the previously started projects that have not yet completed. The df.index.get_loc(r.name) yields the index of row being processed.

df["overlap"]=df.apply(lambda r: df.loc[:df.index.get_loc(r.name),"pr_end_date"].gt(r["pr_start_date"]).sum()-1, axis=1)

pandas.DataFrame.rolling, DataFrame.iterrows · pandas. DataFrame.count · pandas. For a DataFrame, a datetime-like column or MultiIndex level on which to calculate the rolling window, rather Returns. a Window or Rolling sub-classed for the particular operation To learn more about different window types see scipy.signal window functions. Here 5 is the number of rows and 3 is the number of columns. Pandas Count Values for each Column. We will use dataframe count() function to count the number of Non Null values in the dataframe. We will select axis =0 to count the values in each Column


Working with Time Series, The Python world has a number of available representations of dates, times, deltas, and it is helpful to see their relationship to other packages used in Python. of Series and DataFrame objects, which returns a view similar to what we saw with series data, let's take a look at bicycle counts on Seattle's Fremont Bridge. I've got a dataset with a big number of rows. Some of the values are NaN, like this: In [91]: df Out[91]: 1 3 1 1 1 1 3 1 1 1 2 3 1 1 1 1 1 NaN NaN NaN 1 3 1 1 1 1 1 1 1 1


Pandas .groupby(), Lambda Functions, & Pivot Tables, This lesson of the Python Tutorial for Data Analysis covers grouping data with FlightDate Flight Date (yyyymmdd) UniqueCarrier Unique Carrier Code The .​sample() method lets you get a random set of rows of a DataFrame. A pivot table is composed of counts, sums, or other aggregations derived from a table of data. In this tutorial we will learn how to get the unique values ( distinct rows) of a dataframe in python pandas with drop_duplicates() function. Lets see with an example on how to drop duplicates and get Distinct rows of the dataframe in pandas python.


“Group By” in SQL and Python: a Comparison, Exploring the overlapping functionality of SQL and Python can help those of Today, we'll focus on GroupBy operations, which are another great example of a task aggregate functions like count() apply to all of the rows in a dataset and return If you specify a column within the count() function, it will return the number of  Non-Null Row Count: DataFrame.count and Series.count. The methods described here only count non-null values (meaning NaNs are ignored). Calling DataFrame.count will return non-NaN counts for each column: df.count() A 5 B 3 dtype: int64 For Series, use Series.count to similar effect: s.count() # 3 Group-wise Row Count: GroupBy.size