Identification of new values cumulatively by groups in data.table in r

cumulative sum in r
r cumulative sum by group
r cumulative percentage by group
r cumulative sum by group dplyr
running total by group in r
cumulative sum by factor r
cumulative sum by row in r
r group by count

How to create a new column that identifies new value appearance in Letter column cumulatively by groups of unique combs of Year + Month?

Data sample.

require(data.table)
dt <- data.table(Letter = c(LETTERS[c(5, 1:2, 1:2, 1:4, 3:6)]),
                 Year = 2018,
                 Month = c(rep(5,5), rep(6,4), rep(7,4)))

Print.

    Letter Year Month
 1:      E 2018     5
 2:      A 2018     5
 3:      B 2018     5
 4:      A 2018     5
 5:      B 2018     5
 6:      A 2018     6
 7:      B 2018     6
 8:      C 2018     6
 9:      D 2018     6
10:      C 2018     7
11:      D 2018     7
12:      E 2018     7
13:      F 2018     7

Result I'm trying to get:

    Letter Year Month   New
 1:      E 2018     5  TRUE
 2:      A 2018     5  TRUE
 3:      B 2018     5  TRUE
 4:      A 2018     5  TRUE
 5:      B 2018     5  TRUE
 6:      A 2018     6 FALSE
 7:      B 2018     6 FALSE
 8:      C 2018     6  TRUE
 9:      D 2018     6  TRUE
10:      C 2018     7 FALSE
11:      D 2018     7 FALSE
12:      E 2018     7 FALSE
13:      F 2018     7  TRUE

Detailed Question:

  1. Group1 ("E", "A", "B", "A", "B") all TRUE by default as nothing to compare with.
  2. Which of the letters in group2 ("A", "B", "C", "D") is not duplicated in group1.
  3. Then, which of letters in group3 ("C", "D", "E", "F") in not duplicated in both groups 1&2 ("E", "A", "B", "A", "B", "A", "B", "C", "D").

Initialize to FALSE; then join to first Year-Month with each Letter and update to TRUE:

dt[, v := FALSE]
dt[unique(dt, by="Letter"), on=.(Letter, Year, Month), v := TRUE][]

    Letter Year Month     v
 1:      E 2018     5  TRUE
 2:      A 2018     5  TRUE
 3:      B 2018     5  TRUE
 4:      A 2018     5  TRUE
 5:      B 2018     5  TRUE
 6:      A 2018     6 FALSE
 7:      B 2018     6 FALSE
 8:      C 2018     6  TRUE
 9:      D 2018     6  TRUE
10:      C 2018     7 FALSE
11:      D 2018     7 FALSE
12:      E 2018     7 FALSE
13:      F 2018     7  TRUE

Calculate cumulative sum within each ID (group), frame to a data.table by reference; Calculate the cumulative sum of value grouped by id and assign it by reference; Print (the last [] there) the  3 Identification of new values cumulatively by groups in data.table in r Sep 20 '18 3 How to effectively determine the maximum difference between the variable value in each row and same variable subsequent row values in data.table in R Jan 19 '19

Simply:

 # dt[,new := ifelse(Letter %in% dt$Letter[dt$Month<Month],F,T), by="Month"][]

 #   Letter Year Month   new
 #1:      E 2018     5  TRUE
 #2:      A 2018     5  TRUE
 #3:      B 2018     5  TRUE
 #4:      A 2018     5  TRUE
 #5:      B 2018     5  TRUE
 #6:      A 2018     6 FALSE
 #7:      B 2018     6 FALSE
 #8:      C 2018     6  TRUE
 #9:      D 2018     6  TRUE
#10:      C 2018     7 FALSE
#11:      D 2018     7 FALSE
#12:      E 2018     7 FALSE
#13:      F 2018     7  TRUE

With very valid comments of David A., a much faster and less verbose version: (recommended)

dt[, new := !(Letter %in% dt$Letter[dt$Month<Month]), by=Month][]

Frequently Asked Questions about data.table, 1.1 Why do DT[ , 5] and DT[2, 5] return a 1-column data.table rather than you just enhance data.frame in R? Why does it have to be a new package? This runs j for each group in column w but just over the rows where x>1000 . Defining cbind.data.table didn't work because base::cbind does its own S3  So, the question is, if you can do this in spreadsheets and databases, can you do it in R? You bet you can. In the dplyr package, you can create subtotals by combining the group_by() function and the summarise() function. Let’s start with an example. Below is the first part of the mtcars data frame that is provided in the base R package.

Another possible approach:

dupes <- c()
dt[, New := {
    x <- !Letter %chin% dupes
    dupes <- c(dupes, unique(Letter[x]))
    x
}, by=.(Year, Month)]

Some timings for reference below:

if Letter is an integer:

library(microbenchmark)
microbenchmark(mtd0=dt0[, New := !(Letter %in% dt0$Letter[dt0$Month<Month]), by=Month],
    mtd1={
        dt1[, v := FALSE]
        dt1[unique(dt1, by="Letter"), on=.(Letter, Year, Month), v := TRUE]
    },
    mtd2={
        dupes <- c()
        dt2[, New := {
            x <- !Letter %in% dupes
            dupes <- c(dupes, unique(Letter[x]))
            x
        }, by=.(Year, Month)]        
    },
    times=3L)

integer timing output:

Unit: milliseconds
 expr       min       lq      mean    median        uq      max neval
 mtd0 1293.3100 1318.775 1331.7129 1344.2398 1350.9143 1357.589     3
 mtd1  377.1534  391.178  402.4423  405.2026  415.0868  424.971     3
 mtd2 2015.2115 2020.926 2023.7209 2026.6400 2027.9756 2029.311     3

if Letter is a character:

microbenchmark(mtd0=dt0[, New := !(Letter %chin% dt0$Letter[dt0$Month<Month]), by=Month],
    mtd1={
        dt1[, v := FALSE]
        dt1[unique(dt1, by="Letter"), on=.(Letter, Year, Month), v := TRUE]
    },
    mtd2={
        dupes <- c()
        dt2[, New := {
            x <- !Letter %chin% dupes
            dupes <- c(dupes, unique(Letter[x]))
            x
        }, by=.(Year, Month)]        
    },
    times=3L)

timing output:

Unit: milliseconds
 expr       min        lq      mean    median        uq       max neval
 mtd0 1658.5806 1689.8941 1765.9329 1721.2076 1819.6090 1918.0105     3
 mtd1  849.2361  851.1807  852.8632  853.1253  854.6768  856.2283     3
 mtd2  420.1013  426.0941  433.9202  432.0869  440.8296  449.5723     3

check:

> identical(dt2$New, dt1$v)
[1] TRUE
> identical(dt0$New, dt1$v)
[1] FALSE

data:

set.seed(0L)
nr <- 1e7
dt <- unique(data.table(Letter=sample(nr/1e2, nr, replace=TRUE),
    Year=sample(2014:2018, nr, replace=TRUE),
    Month=sample(1:12, nr, replace=TRUE)))
setorder(dt, Year, Month)#[, Letter := as.character(Letter)]
dt0 <- copy(dt)
dt1 <- copy(dt)
dt2 <- copy(dt)

#for seed=0L, dt has about 4.8mio rows

Current stable release (always even) : v1.9.4 on CRAN, released 2 , TRUE retains the data.frame's row names as a new column named rn . TRUE value has been modified" with recently released R 3.1 when grouping a table Thanks to Wet Feet for reporting and to Simon O'Hanlon for identifying the issue  In fisheries analysis it is fairly common to compute the cumulative sum of values in a vector – i.e., all values before and including the current position in the vector. For example, the third value in the cumulative sum would be the sum of the first, second, and third values in the original vector.

Data.table rocks! Data manipulation the fast way in R, Here is a simple example: I have a data frame showing incremental claims Now I would like add a column with the cumulative claims position for each line of IncrPlot <- xyplot(value/1e3 ~ dev | lob, groups=origin,. I am attempting to summarize some variables cumulatively by group for each week in which there was new activity in that group, with a data.table line listing as input. This process works fine with a toy version of the function and a smal

Viral Infections of Humans: Epidemiology and Control, Epidemiology and Control Richard A. Kaslow, Lawrence R. Stanberry, James W. Le Duc Cervical HPV infection as defined by HPV DNA detection in cytologically normal Table 44.2 presents data from large-scale HPV testing of cytologically normal Unknown types at that time, cumulatively considered as a group, still  A code block between braces that has to be carried out for every value in the object values. In the code block, you can use the identifier. Each time R loops through the code, R assigns the next value in the vector with values to the identifier. Calculate values in a for loop. Let’s take another look at the priceCalculator() function. Earlier, we show you a few possibilities to adapt this function so you can apply a different VAT rate for public, private, and foreign clients.

Resources in Education, Step I commences with the identification and ranking of learning needs, and cost/value analysis) and is followed by Step 6, the selection of participants. Industrial Training, Interviews, *Minority Groups, Negro Businesses, Statistical Analysis Public School Systems, Questionnaires, Secondary Education, Tables (Data),  This is also known as ‘Cumulative Sum’ or ‘Rolling Sum’. Basically, we want to keep adding new values on top of the total value that has been accumulated already. Let’s say you are managing product issues or support tickets and you got 5 issues reported yesterday and 3 issues today.

Comments
  • @Andre That's simple & great answer! Thank you!
  • @David Thank you for your addition.
  • If there are multiple years, grouping and testing inequality on months alone will not be sufficient, right? You could make a new YearMonth := paste(Year, Month) variable and use that, though.
  • I dont know what the logic is when you have different years. I would rather split by year. use my code and rbind.
  • @Andre BTW replacing %in% with data.table's native %chin% makes it even more efficient.
  • Fyi, your dupes object outside of DT[...] doesn't get modified. That's not a problem, but some alternatives that might be of interest: have it exist only inside DT[...] like dt[, New := { if (.GRP == 1L) dupes <- c(); x <- !Letter %chin% dupes; dupes <- c(dupes, unique(Letter[x])); x }, by=.(Year, Month)] or use <<- to modify the object outside.
  • thanks, Frank. yeah if unique set is required, then can use <<- to store dupes, then no need to call unique again
  • @chinsoon12 Thank you! Great approach! Especially in terms of efficiency with character strings.