Count distinct by group- moving window

count(distinct window function redshift)
sql rolling sum by date
snowflake count(distinct window)
impala count(distinct)
snowflake distinct column

Let's say I have a dataset contain visits in a hospital. My goal is to generate a variable that counts the number of unique patients the visitor has seen before at the date of the visit. I often work with group_by by dplyr but this seems a little tricky. I guess I would have to use group_by, n_distinct, and sum or some kind moving window command. The "goal" variable is what I need.

visitor visitdt patient goal
125469  1/12/2018   15200   1
125469  1/19/2018   15200   1
125469  2/16/2018   15200   1
125469  2/23/2018   52607   2
125469  3/9/2018    52607   2
125469  3/16/2018   52607   2
125469  3/23/2018   15200   2
125469  3/29/2018   15200   2
125469  3/30/2018   20589   3
125469  4/6/2018    20589   3

Thanks, Marvin

You can do:

with(df, ave(patient, visitor, FUN = function(x) cumsum(!duplicated(x))))

 [1] 1 1 1 2 2 2 2 2 3 3

Essentially, it is a cumulative sum of non-duplicated values per group.

And you can also do the same with dplyr:

df %>%
 group_by(visitor) %>%
 mutate(res = cumsum(!duplicated(patient)))

How to aggregate (counting distinct items) over a sliding window in , date ASC ;. Instead of using select distinct in that subquery you could use group by instead but the execution plan will remain the same  Let's say I have a dataset contain visits in a hospital. My goal is to generate a variable that counts the number of unique patients the visitor has seen before at the date of the visit. I often work with group_by by dplyr but this seems a little tricky. I guess I would have to use group_by, n_distinct, and sum or some kind moving window command.

We can use dplyr

library(dplyr)   
df1 %>%
   group_by(visitor) %>%
    mutate(goal = cummax(match(patient, unique(patient))))
    #or with factor
    # mutate(goal1 = cummax(as.integer(factor(patient, levels = unique(patient)))))

# A tibble: 10 x 4
# Groups:   visitor [1]
#   visitor visitdt   patient  goal
#     <int> <chr>       <int> <int>
# 1  125469 1/12/2018   15200     1
# 2  125469 1/19/2018   15200     1
# 3  125469 2/16/2018   15200     1
# 4  125469 2/23/2018   52607     2
# 5  125469 3/9/2018    52607     2
# 6  125469 3/16/2018   52607     2
# 7  125469 3/23/2018   15200     2
# 8  125469 3/29/2018   15200     2
# 9  125469 3/30/2018   20589     3
#10  125469 4/6/2018    20589     3
data
df1 <- structure(list(visitor = c(125469L, 125469L, 125469L, 125469L, 
125469L, 125469L, 125469L, 125469L, 125469L, 125469L), visitdt = c("1/12/2018", 
"1/19/2018", "2/16/2018", "2/23/2018", "3/9/2018", "3/16/2018", 
"3/23/2018", "3/29/2018", "3/30/2018", "4/6/2018"), patient = c(15200L, 
15200L, 15200L, 52607L, 52607L, 52607L, 15200L, 15200L, 20589L, 
20589L), goal = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L)),
class = "data.frame", row.names = c(NA, 
-10L))

COUNT Function | 5.11.x, You can also combine COUNT with the DISTINCT operator to eliminate When the query contains a GROUP BY clause, returns one value for each results that are cumulative or apply to a moving set of rows (the "window"). Without the GROUP BY clause, only the total number of employees represented in the Employee table is reported: SELECT COUNT(*) FROM Employee; Note that without the GROUP BY clause, the select list cannot include the DeptNo column because it returns any number of values and COUNT(*) returns only one value. Example

Sounds important with what you are tracking. Another option using data.table using non-equi join and then update by reference:

DT[, goal2 :=
    DT[.SD, on=.(visitor, visitdt<=visitdt), allow.cartesian=TRUE, 
        length(unique(patient)), by=.EACHI]$V1]

output:

    visitor    visitdt patient goal goal2
 1:  125469 2018-01-12   15200    1     1
 2:  125469 2018-01-19   15200    1     1
 3:  125469 2018-02-16   15200    1     1
 4:  125469 2018-02-23   52607    2     2
 5:  125469 2018-03-09   52607    2     2
 6:  125469 2018-03-16   52607    2     2
 7:  125469 2018-03-23   15200    2     2
 8:  125469 2018-03-29   15200    2     2
 9:  125469 2018-03-30   20589    3     3
10:  125469 2018-04-06   20589    3     3

data:

library(data.table)
DT <- fread("visitor visitdt patient goal
125469  1/12/2018   15200   1
125469  1/19/2018   15200   1
125469  2/16/2018   15200   1
125469  2/23/2018   52607   2
125469  3/9/2018    52607   2
125469  3/16/2018   52607   2
125469  3/23/2018   15200   2
125469  3/29/2018   15200   2
125469  3/30/2018   20589   3
125469  4/6/2018    20589   3")
DT[, visitdt := as.Date(visitdt, "%m/%d/%Y")]

COUNT Function, You can also combine COUNT with the DISTINCT operator to eliminate duplicates When the query contains a GROUP BY clause, returns one value for each with results that are cumulative or apply to a moving set of rows (the "​window"). A query that uses a distinct aggregate in a windowed function, SELECT COUNT(DISTINCT something) OVER (PARTITION BY other)

COUNT, For details about window_frame syntax, see Window Frame Syntax and Usage. For example, you could count the number of distinct combinations of last name and first SELECT i, COUNT(*), COUNT(j) FROM count_example GROUP BY i;​  SELECT DISTINCT is processed after the OPAP function and COUNT(DISTINCT) can't be used in OLAP, but in this case you don't need it, just GROUP BY first: SELECT name , hair_colour , COUNT(hair_colour) OVER (PARTITION BY name) FROM MyTable GROUP BY name , hair_colour;

Window Functions, Window frame functions allow you to perform rolling operations, such as calculating a running total or a moving average, on a subset of the rows in the window. pandas.core.window.rolling.Rolling.count¶ Rolling.count (self) [source] ¶ The rolling count of any non-NaN observations inside the window. Returns Series or DataFrame. Returned object type is determined by the caller of the rolling calculation.

Count distinct window function in Redshift -, Redshift has a count() window function, but it doesn't support counting distinct items. However, one can still count distinct items in a window by using another  Initialize the count of distinct element as dist_count to 0. Traverse through the first window and insert elements of the first window to hM. The elements are used as key and their counts as the value in hM. Also, keep updating dist_count

Comments
  • Thank you so much! This is great!
  • Thank you so much! This is great! R virtually can do everything!