dplyr: passing a grouped tibble to a custom function

dplyr mutate custom function
group map r
dplyr user defined functions
tidyverse apply
dplyr add row for each group
r group by user defined function
dplyr group by apply function
dplyr group_by

(The following scenario simplifies my actual situation) My data comes from villages, and I would like to summarize an outcome variable by a village variable.

> data
   village     A     Z      Y 
     <chr> <int> <int>   <dbl> 
 1       a     1     1   500     
 2       a     1     1   400     
 3       a     1     0   800  
 4       b     1     0   300  
 5       b     1     1   700  

For example, I would like to calculate the mean of Y only using Z==z by villages. In this case, I want to have (500 + 400)/2 = 450 for village "a" and 700 for village "b".

Please note that the actual situation is more complicated and I cannot directly use this answer, but the point is I need to pass a grouped tibble and a global variable (z) to my function.

z <- 1 # z takes 0 or 1
data %>%
    group_by(village) %>% # grouping by village
    summarize(Y_village = Y_hat_village(., z)) # pass a part of tibble and a global variable

Y_hat_village <- function(data_village, z){
    # This function takes a part of tibble (`data_village`) and a variable `z`
    # Calculate the mean for a specific z in a village
    data_z <- data_village %>% filter(Z==get("z"))
    return(mean(data_z$Y))
}

However, I found . passes entire tibble and the code above returns the same values for all groups.

There are a couple things you can simplify. One is in your function: since you're passing in a value z to the function, you don't need to use get("z"). You have a z in the global environment that you pass in; or, more safely, assign your z value to a variable with some other name so you don't run into scoping issues, and pass that in to the function. In this case, I'm calling it z_val.

library(tidyverse)

z_val <- 1

Y_hat_village2 <- function(data, z) {
  data_z <- data %>% filter(Z == z)
  return(mean(data_z$Y))
}

You can make the function call on each group using do, which will get you a list-column, and then unnesting that column. Again note that I'm passing in the variable z_val to the argument z.

df %>%
  group_by(village) %>%
  do(y_hat = Y_hat_village2(., z = z_val)) %>%
  unnest()
#> # A tibble: 2 x 2
#>   village y_hat
#>   <chr>   <dbl>
#> 1 a         450
#> 2 b         700

However, do is being deprecated in favor of purrr::map, which I am still having trouble getting the hang of. In this case, you can group and nest, which gives a column of data frames called data, then map over that column and again supply z = z_val. When you unnest the y_hat column, you still have the original data as a nested column, since you wanted access to the rest of the columns still.

df %>%
  group_by(village) %>%
  nest() %>%
  mutate(y_hat = map(data, ~Y_hat_village2(., z = z_val))) %>%
  unnest(y_hat)
#> # A tibble: 2 x 3
#>   village data             y_hat
#>   <chr>   <list>           <dbl>
#> 1 a       <tibble [3 × 3]>   450
#> 2 b       <tibble [2 × 3]>   700

Just to check that everything works okay, I also passed in z = 0 to check for 1. scoping issues, and 2. that other values of z work.

df %>%
  group_by(village) %>%
  nest() %>%
  mutate(y_hat = map(data, ~Y_hat_village2(., z = 0))) %>%
  unnest(y_hat)
#> # A tibble: 2 x 3
#>   village data             y_hat
#>   <chr>   <list>           <dbl>
#> 1 a       <tibble [3 × 3]>   800
#> 2 b       <tibble [2 × 3]>   300

Apply a function to each group, .y to refer to the key, a one row tibble with one column per grouping variable that identifies the group Additional arguments passed on to .f. keep. A grouped tibble.f: A function or formula to apply to each group. It must return a data frame. If a function, it is used as is. It should have at least 2 formal arguments. If a formula, e.g. ~ head(.x), it is converted to a function. In the formula, you can use. or .x to refer to the subset of rows of .tbl for the given group

As an extension/modification to @patL's answer, you can also wrap the tidyverse solution within purrr:map to return a list of two tibbles, one for each z value:

z <- c(0, 1);
map(z, ~df %>% filter(Z == .x) %>% group_by(village) %>% summarise(Y.mean = mean(Y)))
#[[1]]
## A tibble: 2 x 2
#  village Y.mean
#  <fct>    <dbl>
#1 a         800.
#2 b         300.
#
#[[2]]
## A tibble: 2 x 2
#  village Y.mean
#  <fct>    <dbl>
#1 a         450.
#2 b         700.

Sample data
df <- read.table(text =
    "  village     A     Z      Y
 1       a     1     1   500
 2       a     1     1   400
 3       a     1     0   800
 4       b     1     0   300
 5       b     1     1   700  ", header = T)

Using a self-written function inside group_by() and mutate , Hi all I am trying to get a function I have written for a single dataset to work inside but I currently get an error with regards to the number of rows in the group. setdiff, setequal, union library(tidyr) # custom function for getting distances A tibble: 6 x 4 #> # Groups: run [2] #> run lat lon dists #> <dbl> <dbl>  The function works fine on a single dataset and within a group_by() and a do() but not within group_by() mutate(). Any tips or ideas why are much appreciated! # I am trying to calculate the cumulative distance betw

You can use dplyr to accomplish it:

library(dplyr)

df %>% 
  group_by(village) %>% 
  filter(Z == 1) %>% 
  summarise(Y_village = mean(Y))

## A tibble: 2 x 2
#  village Y_village
#  <chr>       <dbl>
#1 a             450
#2 b             700

To get all columns:

df %>% 
  group_by(village) %>% 
  filter(Z == 1) %>% 
  mutate(Y_village = mean(Y)) %>% 
  distinct(village, A, Z, Y_village)

## A tibble: 2 x 4
## Groups:   village [2]
#  village     A     Z Y_village
#  <chr>   <dbl> <dbl>     <dbl>
#1 a           1     1       450
#2 b           1     1       700
data
df <- data_frame(village = c("a", "a", "a", "b", "b"),
                 A = rep(1, 5),
                 Z = c(1, 1, 0, 0, 1),
                 Y = c(500, 400, 800, 30, 700))

Data frame columns as arguments to dplyr functions, Data frame columns as arguments to dplyr functions You would like to pass a column as this function's argument. Error: unknown variable to group by : col_name A tibble: 35 x 2 dist mean_speed <dbl> <dbl> 1 2 4.0 2 4 7.0 3 10 6.5 4 14 12.0 5 16 8.0 6 17 11.0 7 18 10.0 8 20 13.5 9 22 7.0 10 24 12.0  This is more like it. It’s easy to read how the data flows. Starting from mtcars, that is then grouped by cyl, and then the mean is taken from the result of this grouping. The reasons the %>% operator is very friendly with dplyr, is that the first argument to all functions is a data frame to operate on.

Programming with dplyr, Most dplyr functions use non-standard evaluation (NSE). Instead, they capture the expression that you typed and evaluate it in a custom way. df <- tibble(x = 1:​3, y = 3:1) filter(df, x == 1) #> # A tibble: 1 x 2 #> x y #> <int> Let's start with a simple case: you want to vary the grouping variable for a data summarization. Programming with dplyr. Most dplyr functions use non-standard evaluation (NSE). This is a catch-all term that means they don’t follow the usual R rules of evaluation. Instead, they capture the expression that you typed and evaluate it in a custom way. This has two main benefits for dplyr code:

group_map: Apply a function to each group in tidyverse/dplyr, .y to refer to the key, a one row tibble with one column per grouping variable that identifies the group Additional arguments passed on to .f. keep. Dynamic column/variable names with dplyr using Standard Evaluation functions September 27, 2016 10:47 am , Markus Konrad Data manipulation works like a charm in R when using a library like dplyr .

group_map: Apply a function to each group in dplyr: A Grammar of , .y to refer to the key, a one row tibble with one column per grouping variable that identifies the group Additional arguments passed on to .f. keep. Passing named list to mutate (and probably other dplyr verbs) I want to write a function that is given a named list which is then passed on to mutate() in a way that each element of the list is an argument to mutate().

Comments
  • I think you're looking for do, you can also consider using split then map
  • Thanks! I think your answer passes a column rather than a part of the tibble. For this toy example, I know it works, but I would like to have all columns in my function.
  • You want to keep all columns including Y or all columns with Y_village?