Can dplyr package be used for conditional mutating?

dplyr ifelse multiple conditions
dplyr conditional mutate
dplyr mutate multiple conditions
dplyr::case_when
mutate dplyr
create new variable in r based on condition
dplyr::mutate if na
create new variable in r dplyr

Can the mutate be used when the mutation is conditional (depending on the values of certain column values)?

This example helps showing what I mean.

structure(list(a = c(1, 3, 4, 6, 3, 2, 5, 1), b = c(1, 3, 4, 
2, 6, 7, 2, 6), c = c(6, 3, 6, 5, 3, 6, 5, 3), d = c(6, 2, 4, 
5, 3, 7, 2, 6), e = c(1, 2, 4, 5, 6, 7, 6, 3), f = c(2, 3, 4, 
2, 2, 7, 5, 2)), .Names = c("a", "b", "c", "d", "e", "f"), row.names = c(NA, 
8L), class = "data.frame")

  a b c d e f
1 1 1 6 6 1 2
2 3 3 3 2 2 3
3 4 4 6 4 4 4
4 6 2 5 5 5 2
5 3 6 3 3 6 2
6 2 7 6 7 7 7
7 5 2 5 2 6 5
8 1 6 3 6 3 2

I was hoping to find a solution to my problem using the dplyr package (and yes I know this not code that should work, but I guess it makes the purpose clear) for creating a new column g:

 library(dplyr)
 df <- mutate(df,
         if (a == 2 | a == 5 | a == 7 | (a == 1 & b == 4)){g = 2},
         if (a == 0 | a == 1 | a == 4 | a == 3 |  c == 4) {g = 3})

The result of the code I am looking for should have this result in this particular example:

  a b c d e f  g
1 1 1 6 6 1 2  3
2 3 3 3 2 2 3  3
3 4 4 6 4 4 4  3
4 6 2 5 5 5 2 NA
5 3 6 3 3 6 2 NA
6 2 7 6 7 7 7  2
7 5 2 5 2 6 5  2
8 1 6 3 6 3 2  3

Does anyone have an idea about how to do this in dplyr? This data frame is just an example, the data frames I am dealing with are much larger. Because of its speed I tried to use dplyr, but perhaps there are other, better ways to handle this problem?

Use ifelse

df %>%
  mutate(g = ifelse(a == 2 | a == 5 | a == 7 | (a == 1 & b == 4), 2,
               ifelse(a == 0 | a == 1 | a == 4 | a == 3 |  c == 4, 3, NA)))

Added - if_else: Note that in dplyr 0.5 there is an if_else function defined so an alternative would be to replace ifelse with if_else; however, note that since if_else is stricter than ifelse (both legs of the condition must have the same type) so the NA in that case would have to be replaced with NA_real_ .

df %>%
  mutate(g = if_else(a == 2 | a == 5 | a == 7 | (a == 1 & b == 4), 2,
               if_else(a == 0 | a == 1 | a == 4 | a == 3 |  c == 4, 3, NA_real_)))

Added - case_when Since this question was posted dplyr has added case_when so another alternative would be:

df %>% mutate(g = case_when(a == 2 | a == 5 | a == 7 | (a == 1 & b == 4) ~ 2,
                            a == 0 | a == 1 | a == 4 | a == 3 |  c == 4 ~ 3,
                            TRUE ~ NA_real_))

Added - arithmetic/na_if If the values are numeric and the conditions (except for the default value of NA at the end) are mutually exclusive, as is the case in the question, then we can use an arithmetic expression such that each term is multiplied by the desired result using na_if at the end to replace 0 with NA.

df %>%
  mutate(g = 2 * (a == 2 | a == 5 | a == 7 | (a == 1 & b == 4)) +
             3 * (a == 0 | a == 1 | a == 4 | a == 3 |  c == 4),
         g = na_if(g, 0))

Can dplyr package be used for conditional mutating?, Use ifelse df %>% mutate(g = ifelse(a == 2 | a == 5 | a == 7 | (a == 1 & b == 4), 2, ifelse(a == 0 | a == 1 | a == 4 | a == 3 | c == 4, 3, NA))). Added - if_else: Note that� Can dplyr package be used for conditional Can dplyr package be used for conditional mutating? 0 votes . 1 view. asked Jul 5, 2019 in R Programming by leealex956 (5

Since you ask for other better ways to handle the problem, here's another way using data.table:

require(data.table) ## 1.9.2+
setDT(df)
df[a %in% c(0,1,3,4) | c == 4, g := 3L]
df[a %in% c(2,5,7) | (a==1 & b==4), g := 2L]

Note the order of conditional statements is reversed to get g correctly. There's no copy of g made, even during the second assignment - it's replaced in-place.

On larger data this would have better performance than using nested if-else, as it can evaluate both 'yes' and 'no' cases, and nesting can get harder to read/maintain IMHO.


Here's a benchmark on relatively bigger data:

# R version 3.1.0
require(data.table) ## 1.9.2
require(dplyr)
DT <- setDT(lapply(1:6, function(x) sample(7, 1e7, TRUE)))
setnames(DT, letters[1:6])
# > dim(DT) 
# [1] 10000000        6
DF <- as.data.frame(DT)

DT_fun <- function(DT) {
    DT[(a %in% c(0,1,3,4) | c == 4), g := 3L]
    DT[a %in% c(2,5,7) | (a==1 & b==4), g := 2L]
}

DPLYR_fun <- function(DF) {
    mutate(DF, g = ifelse(a %in% c(2,5,7) | (a==1 & b==4), 2L, 
            ifelse(a %in% c(0,1,3,4) | c==4, 3L, NA_integer_)))
}

BASE_fun <- function(DF) { # R v3.1.0
    transform(DF, g = ifelse(a %in% c(2,5,7) | (a==1 & b==4), 2L, 
            ifelse(a %in% c(0,1,3,4) | c==4, 3L, NA_integer_)))
}

system.time(ans1 <- DT_fun(DT))
#   user  system elapsed 
#  2.659   0.420   3.107 

system.time(ans2 <- DPLYR_fun(DF))
#   user  system elapsed 
# 11.822   1.075  12.976 

system.time(ans3 <- BASE_fun(DF))
#   user  system elapsed 
# 11.676   1.530  13.319 

identical(as.data.frame(ans1), as.data.frame(ans2))
# [1] TRUE

identical(as.data.frame(ans1), as.data.frame(ans3))
# [1] TRUE

Not sure if this is an alternative you'd asked for, but I hope it helps.

Creating New Variables in R with mutate() and ifelse(), mutate(). mutate() is a basic verb from the dplyr package, and numerous introductions to the package and its functions already exist. can dplyr package be used for conditional mutating? (4) Can the mutate be used when the mutation is conditional (depending on the values of certain column values)? This example helps showing what I mean.

dplyr now has a function case_when that offers a vectorised if. The syntax is a little strange compared to mosaic:::derivedFactor as you cannot access variables in the standard dplyr way, and need to declare the mode of NA, but it is considerably faster than mosaic:::derivedFactor.

df %>%
mutate(g = case_when(a %in% c(2,5,7) | (a==1 & b==4) ~ 2L, 
                     a %in% c(0,1,3,4) | c == 4 ~ 3L, 
                     TRUE~as.integer(NA)))

EDIT: If you're using dplyr::case_when() from before version 0.7.0 of the package, then you need to precede variable names with '.$' (e.g. write .$a == 1 inside case_when).

Benchmark: For the benchmark (reusing functions from Arun 's post) and reducing sample size:

require(data.table) 
require(mosaic) 
require(dplyr)
require(microbenchmark)

set.seed(42) # To recreate the dataframe
DT <- setDT(lapply(1:6, function(x) sample(7, 10000, TRUE)))
setnames(DT, letters[1:6])
DF <- as.data.frame(DT)

DPLYR_case_when <- function(DF) {
  DF %>%
  mutate(g = case_when(a %in% c(2,5,7) | (a==1 & b==4) ~ 2L, 
                       a %in% c(0,1,3,4) | c==4 ~ 3L, 
                       TRUE~as.integer(NA)))
}

DT_fun <- function(DT) {
  DT[(a %in% c(0,1,3,4) | c == 4), g := 3L]
  DT[a %in% c(2,5,7) | (a==1 & b==4), g := 2L]
}

DPLYR_fun <- function(DF) {
  mutate(DF, g = ifelse(a %in% c(2,5,7) | (a==1 & b==4), 2L, 
                    ifelse(a %in% c(0,1,3,4) | c==4, 3L, NA_integer_)))
}

mosa_fun <- function(DF) {
  mutate(DF, g = derivedFactor(
    "2" = (a == 2 | a == 5 | a == 7 | (a == 1 & b == 4)),
    "3" = (a == 0 | a == 1 | a == 4 | a == 3 |  c == 4),
    .method = "first",
    .default = NA
  ))
}

perf_results <- microbenchmark(
  dt_fun <- DT_fun(copy(DT)),
  dplyr_ifelse <- DPLYR_fun(copy(DF)),
  dplyr_case_when <- DPLYR_case_when(copy(DF)),
  mosa <- mosa_fun(copy(DF)),
  times = 100L
)

This gives:

print(perf_results)
Unit: milliseconds
           expr        min         lq       mean     median         uq        max neval
         dt_fun   1.391402    1.560751   1.658337   1.651201   1.716851   2.383801   100
   dplyr_ifelse   1.172601    1.230351   1.331538   1.294851   1.390351   1.995701   100
dplyr_case_when   1.648201    1.768002   1.860968   1.844101   1.958801   2.207001   100
           mosa 255.591301  281.158350 291.391586 286.549802 292.101601 545.880702   100

Conditional mutate � Issue #1665 � tidyverse/dplyr � GitHub, In the more difficult cases I could fall back on the ifelse() function. /questions/ 24459752/can-dplyr-package-be-used-for-conditional-mutating Use summarize with the package, something like this ; dplyr::summarize(count = n()) Can dplyr package be used for conditional mutating? asked Jul 5,

The derivedFactor function from mosaic package seems to be designed to handle this. Using this example, it would look like:

library(dplyr)
library(mosaic)
df <- mutate(df, g = derivedFactor(
     "2" = (a == 2 | a == 5 | a == 7 | (a == 1 & b == 4)),
     "3" = (a == 0 | a == 1 | a == 4 | a == 3 |  c == 4),
     .method = "first",
     .default = NA
     ))

(If you want the result to be numeric instead of a factor, you can wrap derivedFactor in an as.numeric call.)

derivedFactor can be used for an arbitrary number of conditionals, too.

Better syntax for mutate() than nested "ifelse" functions � Issue #1518 , All of the dplyr functions make R code more readable except for one / 24459752/can-dplyr-package-be-used-for-conditional-mutating?rq=1 When using the pipe operator %>% with packages such as dplyr, ggvis, dycharts, etc, how do I do a step conditionally? For example; Can dplyr package be used for

case_when is now a pretty clean implementation of the SQL-style case when:

structure(list(a = c(1, 3, 4, 6, 3, 2, 5, 1), b = c(1, 3, 4, 
2, 6, 7, 2, 6), c = c(6, 3, 6, 5, 3, 6, 5, 3), d = c(6, 2, 4, 
5, 3, 7, 2, 6), e = c(1, 2, 4, 5, 6, 7, 6, 3), f = c(2, 3, 4, 
2, 2, 7, 5, 2)), .Names = c("a", "b", "c", "d", "e", "f"), row.names = c(NA, 
8L), class = "data.frame") -> df


df %>% 
    mutate( g = case_when(
                a == 2 | a == 5 | a == 7 | (a == 1 & b == 4 )     ~   2,
                a == 0 | a == 1 | a == 4 |  a == 3 | c == 4       ~   3
))

Using dplyr 0.7.4

The manual: http://dplyr.tidyverse.org/reference/case_when.html

mutate + if else = new conditional variable, I keep googling these slides by David Ranzolin each time I try to combine mutate with ifelse to create a new variable that is conditional on values in other variables . You can use the rename() function from the dplyr package as follows: df <- rename(df, new_name = old_name) Can dplyr package be used for conditional mutating?

Manipulating data tables with dplyr, The data file FAO_grains_NA.csv will be used in this exercise. The package dplyr offers some nifty and simple querying functions as shown in the next You can add columns (and compute their values) using the mutate function. Here, we make use of an embedded function, ifelse , which performs a conditional� data.table vs dplyr: can one do something well the other can't or does poorly? replace NA in a dplyr chain ; can dplyr package be used for conditional mutating? dplyr rename-Error: `new_name`=old_name must be a symbol or a string, not formula

Create, modify, and delete columns — mutate • dplyr, These function are generics, which means that packages can provide As well as adding new variables, you can use mutate() to # remove variables and modify � dplyr-package: dplyr: a grammar of data manipulation: summarise_all: Summarise and mutate multiple columns. summarise_each: Summarise and mutate multiple columns. same_src: Figure out if two sources are the same (or two tbl have the same source) dr_dplyr: Dr Dplyr checks your installation for common problems. top_n: Select top (or bottom) n

New variable with mutate and ifelse - tidyverse, I was about to say that what you want can't be done, as a particular column of a 'dplyr' #> The following objects are masked from 'package:stats': 1:41 ## uses 0 for rest ## warning occurs because you're comparing one� data.table vs dplyr: can one do something well the other can't or does poorly? can dplyr package be used for conditional mutating? How to extract/subset an element from a list with the magrittr %>% pipe?

Comments
  • Yes but dplyr::case_when() is much clearer than an ifelse,
  • What is the logic if instead of NA, I want the rows that don't meet the conditions to just stay the same?
  • mutate(g = ifelse(condition1, 2, ifelse(condition2, 3, g))
  • case_when is sooooo beautiful, and it took me soooo long to figure out that it was actually there. I think this should be in the simplest dplyr tutorials, it is very common to have the need for calculating stuff for subsets of the data, but still wanting to keep the data complete.
  • Nice piece of code! The answer of G. Grotendieck works and is short so i picked that one as the answer to my question, but I thank you for your solution. I sure will try it this way as well.
  • Since DT_fun is modifying its input inplace, the benchmark might not be quite fair - in addition to not receiving the same input from the 2nd iteration forward (which might affect timing since DT$g is already allocated?), the result also propagates back to ans1 and therefore might (if R's optimizer deems it necessary? Not sure on this...) avoid another copy that DPLYR_fun and BASE_fun need to make?
  • Just to be clear though, I think this data.table solution is great, and I use data.table wherever I really need speed for operations on tables & I don't want to go all the way to C++. It does require being really careful about modifications in place, though!
  • I'm trying to get used to more tidyverse stuff from data.table, and this is one of those examples of a pretty common use-case that data.table is both easier to read and more efficient. My main reason for wanting to develop more tidyverse in my vocabulary is readability for myself and others, but in this case it seems like data.table wins.
  • case_when could also be written as: df %>% mutate(g = with(., case_when(a %in% c(2,5,7) | (a==1 & b==4) ~ 2L, a %in% c(0,1,3,4) | c==4 ~ 3L, TRUE ~ NA_integer_)))
  • Is this benchmark in microseconds/milliseconds/days, what? This benchmark is meaningless without the measurement unit provided. Also, bench-marking on a data set smaller than 1e6 is meaningless too as it doesn't scale.
  • Pls modify your answer, you don't need the .$ anymore in the new version of dplyr
  • @hadley should make this the default syntax for dplyr. Needing nested "ifelse" statements is the single worst part of the package, which is mainly the case because the other functions are so good
  • You can also prevent the result from being a factor using the .asFactor = F option or by using the (similar) derivedVariable function in the same package.
  • It looks like recode from dplyr 0.5 will do this. I haven't investigated it yet though. See blog.rstudio.org/2016/06/27/dplyr-0-5-0