Group by multiple columns in dplyr, using string vector input

dplyr group_by string column name
r group by multiple columns count
dplyr mean of column
r sum multiple columns by group
dplyr group by
r aggregate multiple columns
dplyr group by multiple columns count
group by multiple columns in r

I'm trying to transfer my understanding of plyr into dplyr, but I can't figure out how to group by multiple columns.

# make data with weird column names that can't be hard coded
data = data.frame(
  asihckhdoydkhxiydfgfTgdsx = sample(LETTERS[1:3], 100, replace=TRUE),
  a30mvxigxkghc5cdsvxvyv0ja = sample(LETTERS[1:3], 100, replace=TRUE),
  value = rnorm(100)
)

# get the columns we want to average within
columns = names(data)[-3]

# plyr - works
ddply(data, columns, summarize, value=mean(value))

# dplyr - raises error
data %.%
  group_by(columns) %.%
  summarise(Value = mean(value))
#> Error in eval(expr, envir, enclos) : index out of bounds

What am I missing to translate the plyr example into a dplyr-esque syntax?

Edit 2017: Dplyr has been updated, so a simpler solution is available. See the currently selected answer.

Since this question was posted, dplyr added scoped versions of group_by (documentation here). This lets you use the same functions you would use with select, like so:

data = data.frame(
    asihckhdoydkhxiydfgfTgdsx = sample(LETTERS[1:3], 100, replace=TRUE),
    a30mvxigxkghc5cdsvxvyv0ja = sample(LETTERS[1:3], 100, replace=TRUE),
    value = rnorm(100)
)

# get the columns we want to average within
columns = names(data)[-3]

library(dplyr)
df1 <- data %>%
  group_by_at(vars(one_of(columns))) %>%
  summarize(Value = mean(value))

#compare plyr for reference
df2 <- plyr::ddply(data, columns, plyr::summarize, value=mean(value))
table(df1 == df2, useNA = 'ifany')
## TRUE 
##  27 

The output from your example question is as expected (see comparison to plyr above and output below):

# A tibble: 9 x 3
# Groups:   asihckhdoydkhxiydfgfTgdsx [?]
  asihckhdoydkhxiydfgfTgdsx a30mvxigxkghc5cdsvxvyv0ja       Value
                     <fctr>                    <fctr>       <dbl>
1                         A                         A  0.04095002
2                         A                         B  0.24943935
3                         A                         C -0.25783892
4                         B                         A  0.15161805
5                         B                         B  0.27189974
6                         B                         C  0.20858897
7                         C                         A  0.19502221
8                         C                         B  0.56837548
9                         C                         C -0.22682998

Note that since dplyr::summarize only strips off one layer of grouping at a time, you've still got some grouping going on in the resultant tibble (which can sometime catch people by suprise later down the line). If you want to be absolutely safe from unexpected grouping behavior, you can always add %>% ungroup to your pipeline after you summarize.

How to use group by for multiple columns in dplyr, using string , How to use group by for multiple columns in dplyr, using string vector input in R? I'm trying to implement the dplyr and understand the difference between ply and dplyr. But there is one major problem, I'm not able to use the group_by function for multiple columns. Below is my code:

Just so as to write the code in full, here's an update on Hadley's answer with the new syntax:

library(dplyr)

df <-  data.frame(
    asihckhdoydk = sample(LETTERS[1:3], 100, replace=TRUE),
    a30mvxigxkgh = sample(LETTERS[1:3], 100, replace=TRUE),
    value = rnorm(100)
)

# Columns you want to group by
grp_cols <- names(df)[-3]

# Convert character vector to list of symbols
dots <- lapply(grp_cols, as.symbol)

# Perform frequency counts
df %>%
    group_by_(.dots=dots) %>%
    summarise(n = n())

output:

Source: local data frame [9 x 3]
Groups: asihckhdoydk

  asihckhdoydk a30mvxigxkgh  n
1            A            A 10
2            A            B 10
3            A            C 13
4            B            A 14
5            B            B 10
6            B            C 12
7            C            A  9
8            C            B 12
9            C            C 10

Group by multiple columns in dplyr, using string vector input , You can use the group_by_at function from the dplyr package to group by multiple columns using string vector inputs. The basic syntax is as  You can use the group_by_at function from the dplyr package to group by multiple columns using string vector inputs. The basic syntax is as follows: group_by_at (.tbl, .vars, .funs = list (), , .add = FALSE, .drop = group_by_drop_default (.tbl)) Where.

The support for this in dplyr is currently pretty weak, eventually I think the syntax will be something like:

df %.% group_by(.groups = c("asdfgfTgdsx", "asdfk30v0ja"))

But that probably won't be there for a while (because I need to think through all the consequences).

In the meantime, you can use regroup(), which takes a list of symbols:

library(dplyr)

df <-  data.frame(
  asihckhdoydk = sample(LETTERS[1:3], 100, replace=TRUE),
  a30mvxigxkgh = sample(LETTERS[1:3], 100, replace=TRUE),
  value = rnorm(100)
)

df %.%
  regroup(list(quote(asihckhdoydk), quote(a30mvxigxkgh))) %.%
  summarise(n = n())

If you have have a character vector of column names, you can convert them to the right structure with lapply() and as.symbol():

vars <- setdiff(names(df), "value")
vars2 <- lapply(vars, as.symbol)

df %.% regroup(vars2) %.% summarise(n = n())

Extract a character column into multiple columns using regular , tidyr. part of the tidyverse 1.1.0. Tidy data · Reference; Articles Given a regular expression with capturing groups, extract() turns each group into a new Names of new variables to create as character vector. If TRUE , remove input column from output data frame. NB: this will cause string "NA" s to be converted to NA s​. summarise - Group by multiple columns in dplyr, using string vector input r aggregate multiple columns (6) I'm trying to transfer my understanding of plyr into dplyr, but I can't figure out how to group by multiple columns.

String specification of columns in dplyr are now supported through variants of the dplyr functions with names finishing in an underscore. For example, corresponding to the group_by function there is a group_by_ function that may take string arguments. This vignette describes the syntax of these functions in detail.

The following snippet cleanly solves the problem that @sharoz originally posed (note the need to write out the .dots argument):

# Given data and columns from the OP

data %>%
    group_by_(.dots = columns) %>%
    summarise(Value = mean(value))

(Note that dplyr now uses the %>% operator, and %.% is deprecated).

summarise_all: Summarise multiple columns in dplyr: A Grammar of , summarise_at() affects variables selected with a character vector or vars() If applied on a grouped tibble, these operations are not applied to the grouping variables. the names of the input variables are used to name the created columns. No need for interp here, just use as.formula to convert the strings to formulas: dots = sapply(y, . %>% {as.formula(paste0('~', .))}) mtcars %>% group_by_(.dots = dots) The reason why your interp approach doesn’t work is that the expression gives you back the following:

Until dplyr has full support for string arguments, perhaps this gist is useful:

https://gist.github.com/skranz/9681509

It contains bunch of wrapper functions like s_group_by, s_mutate, s_filter, etc that use string arguments. You can mix them with the normal dplyr functions. For example

cols = c("cyl","gear")
mtcars %.%
  s_group_by(cols) %.%  
  s_summarise("avdisp=mean(disp), max(disp)") %.%
  arrange(avdisp)

5 Data transformation, It tells you that dplyr overwrites some functions in base R. If you want to use the base in the past: it only shows the first few rows and all the columns that fit on one screen. from operating on the entire dataset to operating on it group-by-​group. of values as input, return a vector with the same number of values as output. I need to understand how to input string values (NSE) in dplyr's group_by function. My data set and code below works fine with "group_by" but does not work with "group_by_" version. I am unable to find my mistake in this regard.

[PDF] dplyr, If TRUE, will sort first by grouping variable. Applies A list of columns generated by vars(), a character vector of column names, a a string to use as the name for the measure the dimnames(). Each is a tibble with two variables and three observations Retain only unique/distinct rows from an input tbl. To filter multiple values in a string column using dplyr, Group by multiple columns in dplyr, using string vector input. Extract a dplyr tbl column as a vector.

Manipulating, analyzing and exporting data with tidyverse, I'm trying to implement the dplyr and understand the difference between How to use group by for multiple columns in dplyr, using string vector  I'm struggling a bit with the dplyr-syntax. I have a data frame with different variables and one grouping variable. Now I want to calculate the mean for each column within each group, using dplyr in R.

Compute and Add new Variables to a Data Frame in R, Select certain columns in a data frame with the dplyr function select . Use summarize , group_by , and count to split a data frame into groups of Enter dplyr . dplyr is a package for making tabular data manipulation easier. will use the tidyverse package to read the data and avoid having to set stringsAsFactors to FALSE. String specification of columns in dplyr are now supported through variants of the dplyr functions with names finishing in an underscore. For example, corresponding to the group_by function there is a group_by_ function that may take string arguments.

Comments
  • Just got here as it was top google. You can use group_by_ now explained in vignette("nse")
  • @kungfujam: That appears to only group by the first column, not the pair of columns
  • You need to use .dots. Here's the solution adapted from @hadley 's answer below: df %>% group_by_(.dots=list(quote(asihckhdoydk), quote(a30mvxigxkgh))) %>% summarise(n = n())
  • Have put full code in an answer below
  • As someone pointed out in an answer on the comment, the aim is to not require hardcoded column names.
  • does update to 0.7.0 make the quote-unquote system available with several columns, too?
  • You can also use the .dots arguments to group_by() as such: data %>% group_by(.dots = columns) %>% summarize(value = mean(value)).
  • Does the call to one_of() do anything here? I think it is redundant in this context, as the expression is wrapped in a call to vars().
  • @Khashir yes, this answer still works @knowah You're right, the call to one_of() is redundant in this context
  • This seems to still be hardcoding the column names, just in a formula instead. The point of the question is how to use strings so as to not have to type asihckhdoydk...