Mutate multiple / consecutive columns (with dplyr or base R)

dplyr mutate multiple conditions
mutate multiple columns in r
mutate_if multiple conditions
mutate_at ifelse
copy multiple columns in r
mutate_all replace na
dplyr mutate column name
dplyr change column type

I'm trying to create "waves" of variables that represent repeated measures. Specifically, I'm trying to create consecutive variables that represent the mean values for variables 1 - 10, 11 - 20 ... 91-100. Note that the "..." symbolizes the variables for waves 3 through 9, as avoiding typing these is my goal!

Here is an example data frame, df, with 10 rows and 100 columns:

mat <- matrix(runif(1000, 1, 10), ncol = 100)
df <- data.frame(mat)
dim(df)
> 10 100

I've used the dplyr function mutate which works once all the variables are typed, but is time-intensive and prone to mistakes. I have not been able to find a way to do so without resorting to manually typing the names of the columns, as I started doing below (note that "..." symbolizes waves 3 through 9):

df <- df %>% 
      mutate(wave_1 = (X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9 + X10) / 10,
             wave_2 = (X11 + X12 + X13 + X14 + X15 + X16 + X17 + X18 + X19 + X20) / 10,
             ...
             wave_10 = (X91 + X92 + X93 + X94 + X95 + X96 + X97 + X98 + X99 + X100) / 10)

Can you mutate mutate multiple / consecutive columns with 'dplyr'? Other approaches are also welcome.

Here is one way with the package zoo:

library(zoo)
t(rollapply(t(df), width = 10, by = 10, function(x) sum(x)/10))

Here is one way to do it with base R:

splits <- 1:100
dim(splits) <- c(10, 10)
splits <- split(splits, col(splits))
results <- do.call("cbind", lapply(splits, function(x) data.frame(rowSums(df[,x] / 10))))
names(results) <- paste0("wave_", 1:10)
results

Another very succinct way with base R (courtesy of G.Grothendieck):

t(apply(df, 1, tapply, gl(10, 10), mean))

And here is a solution with dplyr and tidyr:

library(dplyr)
library(tidyr)
df$row <- 1:nrow(df)
df2 <- df %>% gather(column, value, -row)
df2$column <- cut(as.numeric(gsub("X", "", df2$column)),breaks = c(0:10*10))
df2 <- df2 %>% group_by(row, column) %>% summarise(value = sum(value)/10)
df2 %>% spread(column, value) %>% select(-row)

5 Data transformation, It tells you that dplyr overwrites some functions in base R. If you want to use the base Collapse many values down to a single summary ( summarise() ). mutate() always adds new columns at the end of your dataset so we'll start by creating  In dplyr: A Grammar of Data Manipulation. Description Usage Arguments Value Grouping variables Naming See Also Examples. View source: R/colwise-mutate.R. Description. The scoped variants of mutate() and transmute() make it easy to apply the same transformation to multiple variables.

Another dplyr solution which is a bit closer to syntax indicated by the OP and doesn't require recasting the data-frame.

The 4 wave calculations do basically the same thing in slightly different but vectorized (i.e. rowSums and rowMeans) ways:

df <- df %>% 
      mutate(wave_1 = rowSums(select(., num_range("X", 1:10)))/10,
             wave_2 = rowSums(select(., c(11:20)))/10,
             wave_3 = rowMeans(select(., X21:X30)),
             wave_4 = rowMeans(.[, 31:40]))

Edit: . can be used as placeholder for the current dataframe df (code was changed accordingly). Also wave_4 added to demonstrate it can be used like a dataframe.

In case to operating function is not vectorized (that is, it can't be used on the whole dataframe such as rowSums), it is also possible to make use of the rowwise and do function using a non-vectorized functions (e.g. myfun)

myfun <- function (x) {
  sum(x)/10
}

tmp=df %>%
  rowwise() %>%
  do(data.frame(., wave_1 = myfun(unlist(.)[1:10]))) %>%
  do(data.frame(., wave_2 = myfun(unlist(.)[11:20])))

Note: . changes seems to change it's meaning, referring to the whole dataframe for mutate but only the current row for do.

Manipulating Data with dplyr – Data Science Blog by Domino, Indeed, many real-world questions about a data set come down to isolating While this base R syntax achieves the same end, the dplyr approach The mutate() function allows you to create additional columns for your data frame, as illustrated in Figure 11.4. 11.3 Performing Sequential Operations. If a variable in .vars is named, a new column by that name will be created. Name collisions in the new columns are disambiguated using a unique suffix. Life cycle. The functions are maturing, because the naming scheme and the disambiguation algorithm are subject to change in dplyr 0.9.0. See Also. The other scoped verbs, vars() Examples

Another approach (and IMO the recommended approach) using dplyr would be to first reshape or melt your data into a tidy data format before summarizing the values from each wave.

In detail, this process would involve:

  1. Reshape your data to long format (tidyr::gather)
  2. Identify which variables belong to each "wave"
  3. Summarize values for each wave
  4. Reshape your data back to wide format (tidyr::spread)

In your example, this would look like the following:

library(tidyverse)

mat <- matrix(runif(1000, 1, 10), ncol = 100)
df <- data.frame(mat)
dim(df)

df %>%
  dplyr::mutate(id = dplyr::row_number()) %>%
  # reshape to "tidy data" or long format
  tidyr::gather(varname, value, -id) %>%
  # identify which variables belong to which "wave"
  dplyr::mutate(varnum = as.integer(stringr::str_extract(varname, pattern = '\\d+')),
                wave = floor((varnum-1)/10)+1) %>%
  # summarize your value for each wave
  dplyr::group_by(id, wave) %>%
  dplyr::summarise(avg = sum(value)/n()) %>%
  # reshape back to "wide" format
  tidyr::spread(wave, avg, sep='_') %>%
  dplyr::ungroup()

With the following output:

# A tibble: 10 x 11
      id wave_1 wave_2 wave_3 wave_4 wave_5 wave_6 wave_7 wave_8 wave_9 wave_10
   <int>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>   <dbl>
 1     1   6.24   4.49   5.85   5.43   5.98   6.04   4.83   6.92   5.43    5.52
 2     2   5.16   6.82   5.76   6.66   6.21   5.41   4.58   5.06   5.81    6.93
 3     3   7.23   6.28   5.40   5.70   5.13   6.27   5.55   5.84   6.74    5.94
 4     4   5.27   4.79   4.39   6.85   5.31   6.01   6.15   3.31   5.73    5.63
 5     5   6.48   5.16   5.20   4.71   5.87   4.44   6.40   5.00   5.90    3.78
 6     6   4.18   4.64   5.49   5.47   5.75   6.35   4.34   5.66   5.34    6.57
 7     7   4.97   4.09   6.17   5.78   5.87   6.47   4.96   4.39   5.99    5.35
 8     8   5.50   7.21   5.43   5.15   4.56   5.00   4.86   5.72   6.41    5.65
 9     9   5.27   5.71   5.23   5.44   5.12   5.40   5.38   6.05   5.41    5.30
10    10   5.95   4.58   6.52   5.46   7.63   5.56   5.82   7.03   5.68    5.38

This could be joined back to your original data to match the example you gave (which used mutate) as follows:

df %>%
  dplyr::mutate(id = dplyr::row_number()) %>%
  tidyr::gather(varname, value, -id) %>%
  dplyr::mutate(varnum = as.integer(stringr::str_extract(varname, pattern = '\\d+')),
                wave = floor((varnum-1)/10)+1) %>%
  dplyr::group_by(id, wave) %>%
  dplyr::summarise(avg = sum(value)/n()) %>%
  tidyr::spread(wave, avg, sep='_') %>%
  dplyr::ungroup() %>%
  dplyr::right_join(df %>%    # <-- join back to original data
                     dplyr::mutate(id = dplyr::row_number()),
                   by = 'id')

One nice aspect to this approach is that you can inspect your data to confirm that you are correctly assigning variables to "wave"s.

df %>%
  dplyr::mutate(id = dplyr::row_number()) %>%
  tidyr::gather(varname, value, -id) %>%
  dplyr::mutate(varnum = as.integer(stringr::str_extract(varname, pattern = '\\d+')),
                wave = floor((varnum-1)/10)+1) %>%
  dplyr::distinct(varname, varnum, wave) %>%
  head()

which produces:

  varname varnum wave
1      X1      1    1
2      X2      2    1
3      X3      3    1
4      X4      4    1
5      X5      5    1
6      X6      6    1

Changelog • dplyr, bind_rows() correctly handles the cases where there are multiple consecutive NULL (#4296). group_split() is similar to base::split() but operating on existing groups when first() and last() hybrid functions fall back to R evaluation when given no mutate() removes a column when the expression evaluates to NULL for all  Add new columns to a data frame that are functions of existing columns with mutate. Understand the split-apply-combine concept for data analysis. Use summarize , group_by , and tally to split a data frame into groups of observations, apply a summary statistics for each group, and then combine the results.

Mutate multiple columns, Source: R/colwise-mutate.R. mutate_all.Rd. The scoped variants of mutate() and transmute() make it easy to apply the same transformation to multiple variables. A list of columns generated by vars() , a character vector of column names,  In doing so I show that there are many ways to do things, using base R versions in some cases, but with the main focus being the use of dplyr and associated tidyverse tools for taking base-level incident data and aggregating / summarising at a number of different geographical levels.

Data wrangling in R, create new variables with functions of existing variables ( mutate() ),. collapse many values down to a single summary ( summarise() ), So far we've been using packages included in 'base R'; they are 'out-of-the-box' functions. You can 2.4 Use dplyr::select() to subset the data on variables or columns. Motivation Column operations Add Modify Remove Benchmark Summary Motivation The dplyr functions select and mutate nowadays are commonly applied to perform data.frame column operations, frequently combined with magrittrs forward %__% pipe. While working well interactively, however, these methods often would require additional checking if used in “serious” code, for example, to catch column

Chapter 4 Descriptive statistics and data manipulation, Let's first load the starwars data set, included in the {dplyr} package: Now that we have seen how base R works, let's redo the analysis using {tidyverse} verbs. To select non-consecutive years: Using mutate() I've added a column that counts how many times the country appears in the tibble , using n() , another dplyr  Drop column in R using Dplyr – drop variables Drop column in R using Dplyr: Drop column in R can be done by using minus before the select function. Dplyr package in R is provided with select() function which is used to select or drop the columns based on conditions.

Comments
  • Does it have to be with dplyr?
  • No, thank you - another solution would be great, too
  • Thanks, I'll wait to see if anyone has an answer using dplyr, if that's alright
  • @JoshuaRosenberg sure, no need to hurry
  • @G.Grothendieck pretty nice
  • To clarify on the latter code block: By calling do, you're operating on groups of the original data frame, so . in that situation refers to each group. rowwise is a shortcut for group_by in which every row is a separate group, hence why after rowwise, . refers to each row