R/dplyr: Using a loop to create lags and calculate cumulative sums based on column names

r transform dplyr
r mutate add column
mutate in r dplyr
group by in r
dplyr : : cheat sheet
dplyr summarise
summarize in r
tidyverse

I want to loop through a long list of columns in a large-ish dataframe and calculate cumulative sums on the columns' lagged values. Put in other words, I'm kind of calculating how much had been "done" prior to each observation.

Toy dataframe to help make this clearer.

id = c("a", "a", "a", "b", "b")
date = seq(as.Date("2015-12-01"), as.Date("2015-12-05"), by="days")
v1 = sample(seq(1, 20), 5)
v2 = sample(seq(1, 20), 5)
df = data.frame(id, date, v1, v2)

I want it to look like

id   date         v1   v2   v1Cum   v2Cum
a    2015-12-01   1    13     0       0
a    2015-12-02   7    11     1       13
a    2015-12-03   12   2      8       24
b    2015-12-04   18   6      0       0
b    2015-12-05   4    9      18      6

So it's not a cumulative sum of v1 or v2 within the id groups, but rather a cumulative sum of each id's lagged values.

I can do this on individual columns no problem, but I can't seem to generalize it with a loop:

vars = c("v1", "v2")
for (var in vars) {
  lagname = paste(var, "Lag", sep="")
  cumname = paste(var, "Cum", sep="")
  df = arrange(df, id, date)
  df = df %>% 
    group_by(id) %>% 
    mutate(!!lagname := dplyr::lag(var, n = 1, default = NA))
  df[[lagname]] = ifelse(is.na(df[[lagname]]), 0, df[[lagname]])
  df = df %>% group_by(id) %>% arrange(date) %>% mutate(!!cumname := cumsum(!!lagname))
}

The problems, as I see them, are

  • the lag variable just evaluates to NA (or 0 after the ifelse()). I know I haven't quite nailed the mutate().
  • the cumulative summing is evaluating to NA

Any ideas? Thanks for the help! (I'm trying to get back into coding after a break of a couple years. My primary "language" was Stata, however, so I imagine that I'm approaching this a bit wonkily. Happy to revise this completely!)

If I understand you correctly, the following should work:

Reproducible sample data (with 3 variables for summing):

set.seed(123)
df = data.frame(
  id = c("a", "a", "a", "b", "b"),
  date = seq(as.Date("2015-12-01"), as.Date("2015-12-05"), by="days"),
  v1 = sample(seq(1, 20), 5),
  v2 = sample(seq(1, 20), 5),
  v3 = sample(seq(1, 20), 5)
)

> df
  id       date v1 v2 v3
1  a 2015-12-01  6  1 20
2  a 2015-12-02 15 11  9
3  a 2015-12-03  8 17 13
4  b 2015-12-04 16 10 10
5  b 2015-12-05 17  8  2

Group by id, sort by date (in case they aren't in sequence), & mutate for all named variables between the two named ones (v1:v3 in this case):

df %>%
  group_by(id) %>%
  arrange(date) %>%
  mutate_at(vars(v1:v3), funs(Cum = cumsum(lag(., default = 0)))) %>%
  ungroup()


# A tibble: 5 x 8
# Groups: id [2]
  id     date          v1    v2    v3 v1_Cum v2_Cum v3_Cum
  <fctr> <date>     <int> <int> <int>  <int>  <int>  <int>
1 a      2015-12-01     6     1    20      0      0      0
2 a      2015-12-02    15    11     9      6      1     20
3 a      2015-12-03     8    17    13     21     12     29
4 b      2015-12-04    16    10    10      0      0      0
5 b      2015-12-05    17     8     2     16     10     10

5 Data transformation, R/dplyr: Using a loop to create lags and calculate cumulative sums based on column names I want to loop through a long list of columns in a large-ish dataframe and calculate cumulative sums on the columns' lagged  The package dplyr provides convenient tools for the most common data manipulation tasks. It is built to work directly with data frames, with many common tasks optimized by being written in a compiled language (C++). An additional feature is the ability to work directly with data stored in an external database. The benefits of doing this are

Here is a solution using data.table.

id <- c("a", "a", "a", "b", "b")
date <- seq(as.Date("2015-12-01"), as.Date("2015-12-05"), by="days")
v1 <- sample(seq(1, 20), 5)
v2 <- sample(seq(1, 20), 5)
df <- data.frame(id, date, v1, v2)
df

  id       date v1 v2
1  a 2015-12-01 19  9
2  a 2015-12-02  3 17
3  a 2015-12-03  7 14
4  b 2015-12-04 10 15
5  b 2015-12-05  8 11

library(data.table)
tab <- as.data.table(df)[, (c("v1Cum", "v2Cum")) := lapply(.SD, function(x) {
  # Shift v1 and v2.
  xs <- shift(x)

  # Cumulate those values, making an allowance for <NA> values created by the
  # shift function.
  cumsum(ifelse(is.na(xs), 0, xs))
}), by = id, .SDcols = c("v1", "v2")]
tab[]

   id       date v1 v2 v1Cum v2Cum
1:  a 2015-12-01 19  9     0     0
2:  a 2015-12-02  3 17    19     9
3:  a 2015-12-03  7 14    22    26
4:  b 2015-12-04 10 15     0     0
5:  b 2015-12-05  8 11    10    15

Modern R with the tidyverse, It tells you that dplyr overwrites some functions in base R. If you want to use the loading dplyr, you'll need to use their full names: stats::filter() and stats::lag() . used in the past: it only shows the first few rows and all the columns that fit on one screen. Create new variables with functions of existing variables ( mutate() ). Create a Cumulative Sum Column in R One of the first things I learned in R was how to use basic statistics functions like sum(). However, what if you want a cumulative sum to measure how something is building over time–rather than just a total sum to measure the end result?

I used similar approach as Z.Lin.

One extra thing you need to know is that:

you need to use syntax like UQ(rlang::sym(cumname)) to convert a character into expression executable in dplyr since dplyr uses non-standard evaluation.

library(dplyr)
id = c("a", "a", "a", "b", "b")
date = seq(as.Date("2015-12-01"), as.Date("2015-12-05"), by="days")
set.seed(1)
v1 = sample(seq(1, 20), 5)
set.seed(2)
v2 = sample(seq(1, 20), 5)
df = data.frame(id, date, v1, v2)
var_list <- c("v1","v2")
cumname <- "Cum"


df %>%
    group_by(id) %>%
    mutate_at(vars(one_of(var_list)),
              funs(UQ(rlang::sym(cumname)) := cumsum(lag(.,default = 0)))) %>%
    ungroup()

As andrew-reece mentioned, syntax !!cumname := ... works the same and is much more convenient:

df %>%
    group_by(id) %>%
    mutate_at(vars(one_of(var_list)),
              funs(!!cumname := cumsum(lag(.,default = 0)))) %>%
    ungroup()

Chapter 5 Large Data Manipulation, In this chapter, we are going to compute descriptive statistics for a single Using dplyr is possible only if the data you are working with is already in a useful shape. to know these verbs, let's do an analysis using standard, or base R functions. group_by() is a very useful verb; as the name implies, it allows you to create  dplyr is a part of the tidyverse, an ecosystem of packages designed with common APIs and a shared philosophy. Learn more at tidyverse.org.

Consider a simple base R with ave:

set.seed(22)
id = c("a", "a", "a", "b", "b")
date = seq(as.Date("2015-12-01"), as.Date("2015-12-05"), by="days")
v1 = sample(seq(1, 20), 5)
v2 = sample(seq(1, 20), 5)
df = data.frame(id, date, v1, v2)

for (col in c("v1", "v2")) {
   df[[paste0(col, "_cum")]] <- ave(df[[col]], df$id, FUN=function(x) 
                                       cumsum(c(0,x[1:(length(x)-1)])))
} 

print(df)
#  id       date  v1  v2 v1_cum v2_cum
#   a 2015-12-01   7  15      0      0
#   a 2015-12-02  10  12      7     15
#   a 2015-12-03  18  14     17     27
#   b 2015-12-04   9   8      0      0
#   b 2015-12-05  14   6      9      8

Changelog • dplyr, If you are trying to do a specialized analysis that is not included in the base R distribution, Look at the first 6 rows and columns of the catch data: This is because of the time lag for fish to make it from the fishery grounds to the counting Use the new variable run to calculate two new variables: the cumulative run by day  As well as these single-table verbs, dplyr also provides a variety of two-table verbs, which you can learn about in vignette ("two-table"). dplyr is designed to abstract over how the data is stored. That means as well as working with local data frames, you can also work with remote database tables, using exactly the same R code.

[PDF] R Language, *_if() functions correctly handle columns with special names (#4380). nest_join​() creates a list column of the matching rows. nest_join() + tidyr::unnest() is When set to FALSE the groups are generated based on factor levels, hence some groups may be empty (#341). Compute variable names for joins in R (#​3430). To add into a data frame, the cumulative sum of a variable by groups, the syntax is as follow using the dplyr package and the iris demo data set: Code R : library ( dplyr ) iris %>% group_by ( Species ) %>% mutate ( cum_sep_len = cumsum ( Sepal.

Cookbook for R, dplyr::filter() - Select a subset of rows in a data frame that meet a logical A list in R allows you to gather a variety of objects under one name (that is, built-in functions which are better for calculating column and row sums and As a trivial example, consider the use of a for loop to obtain the cumulative sum of a vector of​. If you denote by Pt the stock price at the end of month “t”, the simple return is given by: R t = [ P t - P t-1 ]/ P t-1 , the percentage price difference. Your task in this exercise is to compute the simple returns for every time point “n”. The fact that R is vectorized, makes that relatively easy.

Cumulative sum in r, Reference vector elements with brackets, or with element names. R Sum each row and column into vectors with rowSums() and Create a data frame with the data.frame() function. Break out of loop early with if (condition) { break()} . Use dplyr to merge data instead of base r merge() because dplr  Using dplyr to group, manipulate and summarize data Working with large and complex sets of data is a day-to-day reality in applied statistics. The package dplyr provides a well structured set of functions for manipulating such data collections and performing typical operations with standard syntax that makes them easier to remember.

Comments
  • Ah -- this makes much more sense. Thanks for the help!
  • You can just use !!: !!cumname := ...
  • Oh, I didn't know that one before. That's much more convenient, thanks!