R: calculate number of distinct categories in the specified time frame

number of distinct categories wikipedia
gage r&r formula
how to do a gage r&r
msa gauge r&r
part variation in msa
acceptable repeatability value
ndc
gage r&r wiki

here's some dummy data:

  user_id       date category
       27 2016-01-01    apple
       27 2016-01-03    apple
       27 2016-01-05     pear
       27 2016-01-07     plum
       27 2016-01-10    apple
       27 2016-01-14     pear
       27 2016-01-16     plum
       11 2016-01-01    apple
       11 2016-01-03     pear
       11 2016-01-05     pear
       11 2016-01-07     pear
       11 2016-01-10    apple
       11 2016-01-14    apple
       11 2016-01-16    apple

I'd like to calculate for each user_id the number of distinct categories in the specified time period (e.g. in the past 7, 14 days), including the current order

The solution would look like this:

 user_id       date category distinct_7 distinct_14
      27 2016-01-01    apple          1           1
      27 2016-01-03    apple          1           1
      27 2016-01-05     pear          2           2
      27 2016-01-07     plum          3           3
      27 2016-01-10    apple          3           3
      27 2016-01-14     pear          3           3
      27 2016-01-16     plum          3           3
      11 2016-01-01    apple          1           1
      11 2016-01-03     pear          2           2
      11 2016-01-05     pear          2           2
      11 2016-01-07     pear          2           2
      11 2016-01-10    apple          2           2
      11 2016-01-14    apple          2           2
      11 2016-01-16    apple          1           2

I posted similar questions here or here, however none of it referred to counting cumulative unique values for the specified time period. Thanks a lot for your help!


In the tidyverse, you can use map_int to iterate over a set of values and simplify to an integer à la sapply or vapply. Count distinct occurrences with n_distinct (like length(unique(...))) of an object subset by comparisons or the helper between, with a minimum set by the appropriate amount subtracted from that day, and you're set.

library(tidyverse)

df %>% group_by(user_id) %>% 
    mutate(distinct_7  = map_int(date, ~n_distinct(category[between(date, .x - 7, .x)])), 
           distinct_14 = map_int(date, ~n_distinct(category[between(date, .x - 14, .x)])))

## Source: local data frame [14 x 5]
## Groups: user_id [2]
## 
##    user_id       date category distinct_7 distinct_14
##      <int>     <date>   <fctr>      <int>       <int>
## 1       27 2016-01-01    apple          1           1
## 2       27 2016-01-03    apple          1           1
## 3       27 2016-01-05     pear          2           2
## 4       27 2016-01-07     plum          3           3
## 5       27 2016-01-10    apple          3           3
## 6       27 2016-01-14     pear          3           3
## 7       27 2016-01-16     plum          3           3
## 8       11 2016-01-01    apple          1           1
## 9       11 2016-01-03     pear          2           2
## 10      11 2016-01-05     pear          2           2
## 11      11 2016-01-07     pear          2           2
## 12      11 2016-01-10    apple          2           2
## 13      11 2016-01-14    apple          2           2
## 14      11 2016-01-16    apple          1           2

Gage This or Gage That? How the Number of Distinct Categories , This equation clearly shows the relationship between ndc and %SV and can be used to calculate the number of distinct categories for a given  Summarize time series data by a particular time unit (e.g. month to year, day to month, using pipes etc.). Use dplyr pipes to manipulate data in R. What You Need. You need R and RStudio to complete this tutorial. Also you should have an earth-analytics directory set up on your computer with a /data directory within it.


Here are two data.table solutions, one with two nested lapplyand the other using non-equi joins.

The first one is a rather clumsy data.table solution but it reproduces the expected answer. And it would work for an arbitrary number of time frames. (Although @alistaire's concise tidyverse solution he had suggested in his comment could be modified as well).

It uses two nested lapply. The first one loops over the time frames, the second one over the dates. The tempory result is joined with the original data and then reshaped from long to wide format so that we will end with a separate column for each of the time frames.

library(data.table)
tmp <- rbindlist(
  lapply(c(7L, 14L), 
         function(ldays) rbindlist(
           lapply(unique(dt$date), 
                  function(ldate) {
                    dt[between(date, ldate - ldays, ldate), 
                       .(distinct = sprintf("distinct_%02i", ldays), 
                         date = ldate, 
                         N = uniqueN(category)), 
                       by = .(user_id)]
                  })
         )
  )
)
dcast(tmp[dt, on=c("user_id", "date")], 
      ... ~ distinct, value.var = "N")[order(-user_id, date, category)] 
#          date user_id category distinct_07 distinct_14
# 1: 2016-01-01      27    apple           1           1
# 2: 2016-01-03      27    apple           1           1
# 3: 2016-01-05      27     pear           2           2
# 4: 2016-01-07      27     plum           3           3
# 5: 2016-01-10      27    apple           3           3
# 6: 2016-01-14      27     pear           3           3
# 7: 2016-01-16      27     plum           3           3
# 8: 2016-01-01      11    apple           1           1
# 9: 2016-01-03      11     pear           2           2
#10: 2016-01-05      11     pear           2           2
#11: 2016-01-07      11     pear           2           2
#12: 2016-01-10      11    apple           2           2
#13: 2016-01-14      11    apple           2           2
#14: 2016-01-16      11    apple           1           2

Here is a variant following a suggestion by @Frank which uses data.table's non-equi joins instead of the second lapply:

tmp <- rbindlist(
  lapply(c(7L, 14L), 
         function(ldays) {
           dt[.(user_id = user_id, dago = date - ldays, d = date), 
              on=.(user_id, date >= dago, date <= d), 
              .(distinct = sprintf("distinct_%02i", ldays),
                N = uniqueN(category)), 
              by = .EACHI]
         }
  )
)[, date := NULL]
# 
dcast(tmp[dt, on=c("user_id", "date")], 
      ... ~ distinct, value.var = "N")[order(-user_id, date, category)] 

Data:

dt <- fread("user_id       date category
       27 2016-01-01    apple
       27 2016-01-03    apple
       27 2016-01-05     pear
       27 2016-01-07     plum
       27 2016-01-10    apple
       27 2016-01-14     pear
       27 2016-01-16     plum
       11 2016-01-01    apple
       11 2016-01-03     pear
       11 2016-01-05     pear
       11 2016-01-07     pear
       11 2016-01-10    apple
       11 2016-01-14    apple
       11 2016-01-16    apple")
dt[, date := as.IDate(date)]

BTW: The wording in the past 7, 14 days is somewhat misleading as the time periods actually consist of 8 and 15 days, resp.

[PDF] Number of Distinct Categories and %Study Variation, The output for Gage R & R studies includes the number of distinct categories (ndc​) and the shows the relationship between ndc and %SV and can be used to calculate the number of distinct categories for a given percentage study variation. To figure out what data can be factored when working in R, let’s take a look at the dataset mtcars. This built-in dataset describes fuel consumption and ten different design points from 32 cars from the 1970s. It contains, in total, 11 variables, but all of them are numeric. Although you can work with the …


U recommend using runner package. You can use any R function on running windows with runner function. Code below obtains desided output, which is past 7-days + current and past 14-days + current (current 8 and 15 days):

df <- read.table(
  text = "  user_id       date category
       27 2016-01-01    apple
  27 2016-01-03    apple
  27 2016-01-05     pear
  27 2016-01-07     plum
  27 2016-01-10    apple
  27 2016-01-14     pear
  27 2016-01-16     plum
  11 2016-01-01    apple
  11 2016-01-03     pear
  11 2016-01-05     pear
  11 2016-01-07     pear
  11 2016-01-10    apple
  11 2016-01-14    apple
  11 2016-01-16    apple", header = TRUE, colClasses = c("integer", "Date", "character"))



library(dplyr)
library(runner)
df %>%
  group_by(user_id) %>%
  mutate(distinct_7  = runner(category, k = 7 + 1, idx = date, 
                              f = function(x) length(unique(x))),
         distinct_14 = runner(category, k = 14 + 1, idx = date, 
                              f = function(x) length(unique(x))))

More informations in package and function documentation.

Number of Distinct Categories" or NDC Calculation in MSA Studies, Number of Distinct Categories" or NDC Calculation in MSA Studies the variation in the sample set to achieve a better NDC to satisfy their customers. by looking at the X-bar and R charts (bad measurement systems reveal themselves pretty quickly). Try this MSA spreadsheet I posted some time ago. Otherwise, distinct() first calls mutate() to create new columns. Groups are not modified. Data frame attributes are preserved. Methods. This function is a generic, which means that packages can provide implementations (methods) for other classes. See the documentation of individual methods for extra arguments and differences in behaviour.


How to Calculate NDC, In statistical analysis, NDC stands for "number of distinct categories. Manufacturers will set that range by calculating the upper and lower control limits. 1 percent of the time, according to the properties of the statistical normal distribution. Arguments df. data frame to be processed. vars. variables to count unique values of. wt_var. optional variable to weight by - if this is non-NULL, count will sum up the value of this variable for each combination of id variables.


What is the Number of Distinct Categories, When I am doing a Gage R&R study, most of the time I get a result of less than 5. A small number of categories will limit your ability to estimate process parameters and/or limit *Only indicated whether the process is producing ok or not ok parts. Why it need to multiple 1.41 when we calculate NDC? You can convert this to a data frame, but the result does not retain the variable name “gear” in the corresponding column name. > t = as.data.frame(w) > t Var1 Freq 1 3 15 2 4 12 3 5 5. You can correct this problem with the names() function. > names(t)[1] = 'gear' > t gear Freq 1 3 15 2 4 12 3 5 5


(PDF) Number of distinct data categories and gage repeatability and , PDF | Measurement systems analysis has been widely used in many industries, especially those related to the automotive sector as part of statistical | Find  count package:plyr R Documentation Count the number of occurences. Description: Equivalent to ‘as.data.frame(table(x))’, but does not include combinations with zero counts.