## R: calculate number of distinct categories in the specified time frame

number of distinct categories wikipedia
gage r&r formula
how to do a gage r&r
msa gauge r&r
part variation in msa
acceptable repeatability value
ndc
gage r&r wiki

here's some dummy data:

user_id       date category
27 2016-01-01    apple
27 2016-01-03    apple
27 2016-01-05     pear
27 2016-01-07     plum
27 2016-01-10    apple
27 2016-01-14     pear
27 2016-01-16     plum
11 2016-01-01    apple
11 2016-01-03     pear
11 2016-01-05     pear
11 2016-01-07     pear
11 2016-01-10    apple
11 2016-01-14    apple
11 2016-01-16    apple

I'd like to calculate for each user_id the number of distinct categories in the specified time period (e.g. in the past 7, 14 days), including the current order

The solution would look like this:

user_id       date category distinct_7 distinct_14
27 2016-01-01    apple          1           1
27 2016-01-03    apple          1           1
27 2016-01-05     pear          2           2
27 2016-01-07     plum          3           3
27 2016-01-10    apple          3           3
27 2016-01-14     pear          3           3
27 2016-01-16     plum          3           3
11 2016-01-01    apple          1           1
11 2016-01-03     pear          2           2
11 2016-01-05     pear          2           2
11 2016-01-07     pear          2           2
11 2016-01-10    apple          2           2
11 2016-01-14    apple          2           2
11 2016-01-16    apple          1           2

I posted similar questions here or here, however none of it referred to counting cumulative unique values for the specified time period. Thanks a lot for your help!

In the tidyverse, you can use map_int to iterate over a set of values and simplify to an integer à la sapply or vapply. Count distinct occurrences with n_distinct (like length(unique(...))) of an object subset by comparisons or the helper between, with a minimum set by the appropriate amount subtracted from that day, and you're set.

library(tidyverse)

df %>% group_by(user_id) %>%
mutate(distinct_7  = map_int(date, ~n_distinct(category[between(date, .x - 7, .x)])),
distinct_14 = map_int(date, ~n_distinct(category[between(date, .x - 14, .x)])))

## Source: local data frame [14 x 5]
## Groups: user_id [2]
##
##    user_id       date category distinct_7 distinct_14
##      <int>     <date>   <fctr>      <int>       <int>
## 1       27 2016-01-01    apple          1           1
## 2       27 2016-01-03    apple          1           1
## 3       27 2016-01-05     pear          2           2
## 4       27 2016-01-07     plum          3           3
## 5       27 2016-01-10    apple          3           3
## 6       27 2016-01-14     pear          3           3
## 7       27 2016-01-16     plum          3           3
## 8       11 2016-01-01    apple          1           1
## 9       11 2016-01-03     pear          2           2
## 10      11 2016-01-05     pear          2           2
## 11      11 2016-01-07     pear          2           2
## 12      11 2016-01-10    apple          2           2
## 13      11 2016-01-14    apple          2           2
## 14      11 2016-01-16    apple          1           2

Gage This or Gage That? How the Number of Distinct Categories , This equation clearly shows the relationship between ndc and %SV and can be used to calculate the number of distinct categories for a given  Summarize time series data by a particular time unit (e.g. month to year, day to month, using pipes etc.). Use dplyr pipes to manipulate data in R. What You Need. You need R and RStudio to complete this tutorial. Also you should have an earth-analytics directory set up on your computer with a /data directory within it.

Here are two data.table solutions, one with two nested lapplyand the other using non-equi joins.

The first one is a rather clumsy data.table solution but it reproduces the expected answer. And it would work for an arbitrary number of time frames. (Although @alistaire's concise tidyverse solution he had suggested in his comment could be modified as well).

It uses two nested lapply. The first one loops over the time frames, the second one over the dates. The tempory result is joined with the original data and then reshaped from long to wide format so that we will end with a separate column for each of the time frames.

library(data.table)
tmp <- rbindlist(
lapply(c(7L, 14L),
function(ldays) rbindlist(
lapply(unique(dt\$date),
function(ldate) {
dt[between(date, ldate - ldays, ldate),
.(distinct = sprintf("distinct_%02i", ldays),
date = ldate,
N = uniqueN(category)),
by = .(user_id)]
})
)
)
)
dcast(tmp[dt, on=c("user_id", "date")],
... ~ distinct, value.var = "N")[order(-user_id, date, category)]
#          date user_id category distinct_07 distinct_14
# 1: 2016-01-01      27    apple           1           1
# 2: 2016-01-03      27    apple           1           1
# 3: 2016-01-05      27     pear           2           2
# 4: 2016-01-07      27     plum           3           3
# 5: 2016-01-10      27    apple           3           3
# 6: 2016-01-14      27     pear           3           3
# 7: 2016-01-16      27     plum           3           3
# 8: 2016-01-01      11    apple           1           1
# 9: 2016-01-03      11     pear           2           2
#10: 2016-01-05      11     pear           2           2
#11: 2016-01-07      11     pear           2           2
#12: 2016-01-10      11    apple           2           2
#13: 2016-01-14      11    apple           2           2
#14: 2016-01-16      11    apple           1           2

Here is a variant following a suggestion by @Frank which uses data.table's non-equi joins instead of the second lapply:

tmp <- rbindlist(
lapply(c(7L, 14L),
function(ldays) {
dt[.(user_id = user_id, dago = date - ldays, d = date),
on=.(user_id, date >= dago, date <= d),
.(distinct = sprintf("distinct_%02i", ldays),
N = uniqueN(category)),
by = .EACHI]
}
)
)[, date := NULL]
#
dcast(tmp[dt, on=c("user_id", "date")],
... ~ distinct, value.var = "N")[order(-user_id, date, category)]

Data:

27 2016-01-01    apple
27 2016-01-03    apple
27 2016-01-05     pear
27 2016-01-07     plum
27 2016-01-10    apple
27 2016-01-14     pear
27 2016-01-16     plum
11 2016-01-01    apple
11 2016-01-03     pear
11 2016-01-05     pear
11 2016-01-07     pear
11 2016-01-10    apple
11 2016-01-14    apple
11 2016-01-16    apple")
dt[, date := as.IDate(date)]

BTW: The wording in the past 7, 14 days is somewhat misleading as the time periods actually consist of 8 and 15 days, resp.

[PDF] Number of Distinct Categories and %Study Variation, The output for Gage R & R studies includes the number of distinct categories (ndc​) and the shows the relationship between ndc and %SV and can be used to calculate the number of distinct categories for a given percentage study variation. To figure out what data can be factored when working in R, let’s take a look at the dataset mtcars. This built-in dataset describes fuel consumption and ten different design points from 32 cars from the 1970s. It contains, in total, 11 variables, but all of them are numeric. Although you can work with the …

U recommend using runner package. You can use any R function on running windows with runner function. Code below obtains desided output, which is past 7-days + current and past 14-days + current (current 8 and 15 days):

text = "  user_id       date category
27 2016-01-01    apple
27 2016-01-03    apple
27 2016-01-05     pear
27 2016-01-07     plum
27 2016-01-10    apple
27 2016-01-14     pear
27 2016-01-16     plum
11 2016-01-01    apple
11 2016-01-03     pear
11 2016-01-05     pear
11 2016-01-07     pear
11 2016-01-10    apple
11 2016-01-14    apple
11 2016-01-16    apple", header = TRUE, colClasses = c("integer", "Date", "character"))

library(dplyr)
library(runner)
df %>%
group_by(user_id) %>%
mutate(distinct_7  = runner(category, k = 7 + 1, idx = date,
f = function(x) length(unique(x))),
distinct_14 = runner(category, k = 14 + 1, idx = date,
f = function(x) length(unique(x))))