## R: calculate number of distinct categories in the specified time frame

here's some dummy data:

user_id date category 27 2016-01-01 apple 27 2016-01-03 apple 27 2016-01-05 pear 27 2016-01-07 plum 27 2016-01-10 apple 27 2016-01-14 pear 27 2016-01-16 plum 11 2016-01-01 apple 11 2016-01-03 pear 11 2016-01-05 pear 11 2016-01-07 pear 11 2016-01-10 apple 11 2016-01-14 apple 11 2016-01-16 apple

I'd like to calculate **for each user_id the number of distinct categories in the specified time period (e.g. in the past 7, 14 days), including the current order**

The solution would look like this:

user_id date category distinct_7 distinct_14 27 2016-01-01 apple 1 1 27 2016-01-03 apple 1 1 27 2016-01-05 pear 2 2 27 2016-01-07 plum 3 3 27 2016-01-10 apple 3 3 27 2016-01-14 pear 3 3 27 2016-01-16 plum 3 3 11 2016-01-01 apple 1 1 11 2016-01-03 pear 2 2 11 2016-01-05 pear 2 2 11 2016-01-07 pear 2 2 11 2016-01-10 apple 2 2 11 2016-01-14 apple 2 2 11 2016-01-16 apple 1 2

I posted similar questions here or here, however none of it referred to counting cumulative unique values for the specified time period. Thanks a lot for your help!

In the tidyverse, you can use `map_int`

to iterate over a set of values and simplify to an integer à la `sapply`

or `vapply`

. Count distinct occurrences with `n_distinct`

(like `length(unique(...))`

) of an object subset by comparisons or the helper `between`

, with a minimum set by the appropriate amount subtracted from that day, and you're set.

library(tidyverse) df %>% group_by(user_id) %>% mutate(distinct_7 = map_int(date, ~n_distinct(category[between(date, .x - 7, .x)])), distinct_14 = map_int(date, ~n_distinct(category[between(date, .x - 14, .x)]))) ## Source: local data frame [14 x 5] ## Groups: user_id [2] ## ## user_id date category distinct_7 distinct_14 ## <int> <date> <fctr> <int> <int> ## 1 27 2016-01-01 apple 1 1 ## 2 27 2016-01-03 apple 1 1 ## 3 27 2016-01-05 pear 2 2 ## 4 27 2016-01-07 plum 3 3 ## 5 27 2016-01-10 apple 3 3 ## 6 27 2016-01-14 pear 3 3 ## 7 27 2016-01-16 plum 3 3 ## 8 11 2016-01-01 apple 1 1 ## 9 11 2016-01-03 pear 2 2 ## 10 11 2016-01-05 pear 2 2 ## 11 11 2016-01-07 pear 2 2 ## 12 11 2016-01-10 apple 2 2 ## 13 11 2016-01-14 apple 2 2 ## 14 11 2016-01-16 apple 1 2

**Gage This or Gage That? How the Number of Distinct Categories ,** This equation clearly shows the relationship between ndc and %SV and can be used to calculate the number of distinct categories for a given Summarize time series data by a particular time unit (e.g. month to year, day to month, using pipes etc.). Use dplyr pipes to manipulate data in R. What You Need. You need R and RStudio to complete this tutorial. Also you should have an earth-analytics directory set up on your computer with a /data directory within it.

Here are two `data.table`

solutions, one with two nested `lapply`

and the other using *non-equi joins*.

The first one is a rather clumsy `data.table`

solution but it reproduces the expected answer. And it would work for an arbitrary number of time frames. (Although @alistaire's concise `tidyverse`

solution he had suggested in his comment could be modified as well).

It uses two nested `lapply`

. The first one loops over the time frames, the second one over the dates. The tempory result is joined with the original data and then reshaped from long to wide format so that we will end with a separate column for each of the time frames.

library(data.table) tmp <- rbindlist( lapply(c(7L, 14L), function(ldays) rbindlist( lapply(unique(dt$date), function(ldate) { dt[between(date, ldate - ldays, ldate), .(distinct = sprintf("distinct_%02i", ldays), date = ldate, N = uniqueN(category)), by = .(user_id)] }) ) ) ) dcast(tmp[dt, on=c("user_id", "date")], ... ~ distinct, value.var = "N")[order(-user_id, date, category)] # date user_id category distinct_07 distinct_14 # 1: 2016-01-01 27 apple 1 1 # 2: 2016-01-03 27 apple 1 1 # 3: 2016-01-05 27 pear 2 2 # 4: 2016-01-07 27 plum 3 3 # 5: 2016-01-10 27 apple 3 3 # 6: 2016-01-14 27 pear 3 3 # 7: 2016-01-16 27 plum 3 3 # 8: 2016-01-01 11 apple 1 1 # 9: 2016-01-03 11 pear 2 2 #10: 2016-01-05 11 pear 2 2 #11: 2016-01-07 11 pear 2 2 #12: 2016-01-10 11 apple 2 2 #13: 2016-01-14 11 apple 2 2 #14: 2016-01-16 11 apple 1 2

Here is a variant following a suggestion by @Frank which uses `data.table`

's *non-equi joins* instead of the second `lapply`

:

tmp <- rbindlist( lapply(c(7L, 14L), function(ldays) { dt[.(user_id = user_id, dago = date - ldays, d = date), on=.(user_id, date >= dago, date <= d), .(distinct = sprintf("distinct_%02i", ldays), N = uniqueN(category)), by = .EACHI] } ) )[, date := NULL] # dcast(tmp[dt, on=c("user_id", "date")], ... ~ distinct, value.var = "N")[order(-user_id, date, category)]

Data:

dt <- fread("user_id date category 27 2016-01-01 apple 27 2016-01-03 apple 27 2016-01-05 pear 27 2016-01-07 plum 27 2016-01-10 apple 27 2016-01-14 pear 27 2016-01-16 plum 11 2016-01-01 apple 11 2016-01-03 pear 11 2016-01-05 pear 11 2016-01-07 pear 11 2016-01-10 apple 11 2016-01-14 apple 11 2016-01-16 apple") dt[, date := as.IDate(date)]

BTW: The wording *in the past 7, 14 days* is somewhat misleading as the time periods actually consist of 8 and 15 days, resp.

**[PDF] Number of Distinct Categories and %Study Variation,** The output for Gage R & R studies includes the number of distinct categories (ndc) and the shows the relationship between ndc and %SV and can be used to calculate the number of distinct categories for a given percentage study variation. To figure out what data can be factored when working in R, let’s take a look at the dataset mtcars. This built-in dataset describes fuel consumption and ten different design points from 32 cars from the 1970s. It contains, in total, 11 variables, but all of them are numeric. Although you can work with the …

U recommend using runner package. You can use any R function on running windows with `runner`

function. Code below obtains desided output, which is past 7-days + current and past 14-days + current (current 8 and 15 days):

df <- read.table( text = " user_id date category 27 2016-01-01 apple 27 2016-01-03 apple 27 2016-01-05 pear 27 2016-01-07 plum 27 2016-01-10 apple 27 2016-01-14 pear 27 2016-01-16 plum 11 2016-01-01 apple 11 2016-01-03 pear 11 2016-01-05 pear 11 2016-01-07 pear 11 2016-01-10 apple 11 2016-01-14 apple 11 2016-01-16 apple", header = TRUE, colClasses = c("integer", "Date", "character")) library(dplyr) library(runner) df %>% group_by(user_id) %>% mutate(distinct_7 = runner(category, k = 7 + 1, idx = date, f = function(x) length(unique(x))), distinct_14 = runner(category, k = 14 + 1, idx = date, f = function(x) length(unique(x))))

More informations in package and function documentation.

**Number of Distinct Categories" or NDC Calculation in MSA Studies,** Number of Distinct Categories" or NDC Calculation in MSA Studies the variation in the sample set to achieve a better NDC to satisfy their customers. by looking at the X-bar and R charts (bad measurement systems reveal themselves pretty quickly). Try this MSA spreadsheet I posted some time ago. Otherwise, distinct() first calls mutate() to create new columns. Groups are not modified. Data frame attributes are preserved. Methods. This function is a generic, which means that packages can provide implementations (methods) for other classes. See the documentation of individual methods for extra arguments and differences in behaviour.

**How to Calculate NDC,** In statistical analysis, NDC stands for "number of distinct categories. Manufacturers will set that range by calculating the upper and lower control limits. 1 percent of the time, according to the properties of the statistical normal distribution. Arguments df. data frame to be processed. vars. variables to count unique values of. wt_var. optional variable to weight by - if this is non-NULL, count will sum up the value of this variable for each combination of id variables.

**What is the Number of Distinct Categories,** When I am doing a Gage R&R study, most of the time I get a result of less than 5. A small number of categories will limit your ability to estimate process parameters and/or limit *Only indicated whether the process is producing ok or not ok parts. Why it need to multiple 1.41 when we calculate NDC? You can convert this to a data frame, but the result does not retain the variable name “gear” in the corresponding column name. > t = as.data.frame(w) > t Var1 Freq 1 3 15 2 4 12 3 5 5. You can correct this problem with the names() function. > names(t)[1] = 'gear' > t gear Freq 1 3 15 2 4 12 3 5 5

**(PDF) Number of distinct data categories and gage repeatability and ,** PDF | Measurement systems analysis has been widely used in many industries, especially those related to the automotive sector as part of statistical | Find count package:plyr R Documentation Count the number of occurences. Description: Equivalent to ‘as.data.frame(table(x))’, but does not include combinations with zero counts.