Dynamic variable names in R regressions
Being aware of the danger of using dynamic variable names, I am trying to loop over varios regression models where different variables specifications are choosen. Usually
!!rlang::sym() solves this kind of problem for me just fine, but it somehow fails in regressions. A minimal example would be the following:
y= runif(1000) x1 = runif(1000) x2 = runif(1000) df2= data.frame(y,x1,x2) summary(lm(y ~ x1+x2, data=df2)) ## works var = "x1" summary(lm(y ~ !!rlang::sym(var)) +x2, data=df2) # gives an error
My understanding was that
!!rlang::sym(var)) takes the values of
var (namely x1) and puts that in the code in a way that R thinks this is a variable (not a char). BUt I seem to be wrong. Can anyone enlighten me?
Personally, I like to do this with some computing on the language. For me, a combination of
eval is easiest (to remember).
var <- as.symbol(var) eval(bquote(summary(lm(y ~ .(var) + x2, data = df2)))) #Call: #lm(formula = y ~ x1 + x2, data = df2) # #Residuals: # Min 1Q Median 3Q Max #-0.49298 -0.26248 -0.00046 0.24111 0.51988 # #Coefficients: # Estimate Std. Error t value Pr(>|t|) #(Intercept) 0.50244 0.02480 20.258 <2e-16 *** #x1 -0.01468 0.03161 -0.464 0.643 #x2 -0.01635 0.03227 -0.507 0.612 #--- #Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 # #Residual standard error: 0.2878 on 997 degrees of freedom #Multiple R-squared: 0.0004708, Adjusted R-squared: -0.001534 #F-statistic: 0.2348 on 2 and 997 DF, p-value: 0.7908
I find this superior to any approach that doesn't show the same call as
summary(lm(y ~ x1+x2, data=df2)).
How can I loop through a list of strings as variables in a model?, The code below gives an example of how to loop through a list of variable names as strings and use the variable name in a model. A single string is generated in other words i want to create separate variable name each time the for loop executes. NOTE: it is preferred to have variable names to be in the name of "project"columns values. r variables dynamic-data
The bang-bang operator
!! only works with "tidy" functions. It's not a part of the core R language. A base R function like
lm() has no idea how to expand such operators. Instead, you need to wrap those in functions that can do the expansion.
rlang::expr is one such example
rlang::expr(summary(lm(y ~ !!rlang::sym(var) + x2, data=df2))) # summary(lm(y ~ x1 + x2, data = df2))
Then you need to use
rlang::eval_tidy to actually evaluate it
rlang::eval_tidy(rlang::expr(summary(lm(y ~ !!rlang::sym(var) + x2, data=df2)))) # Call: # lm(formula = y ~ x1 + x2, data = df2) # # Residuals: # Min 1Q Median 3Q Max # -0.49178 -0.25482 0.00027 0.24566 0.50730 # # Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) 0.4953683 0.0242949 20.390 <2e-16 *** # x1 -0.0006298 0.0314389 -0.020 0.984 # x2 -0.0052848 0.0318073 -0.166 0.868 # --- # Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 # # Residual standard error: 0.2882 on 997 degrees of freedom # Multiple R-squared: 2.796e-05, Adjusted R-squared: -0.001978 # F-statistic: 0.01394 on 2 and 997 DF, p-value: 0.9862
You can see this version preserves the expanded formula in the model object.
to: Create a Sequence of Numbered Variable Names with a , Generates sequentially numbered variable names, all starting with the same prefix, equivalent to standard R function paste0("m", 1:10) # generate a 10 x 10 data by2 6. reg(Y ~ X, Rmd="eg") Regression + R markdown file that, when knit, For each list of variable arguments, we want to group using the first variable and then summarise the grouped data frame by calculating the mean of the second variable. Here, dynamic argument construction really comes into account, because we programmatically construct the arguments of summarise_() , e.g. mean_mpg = mean(mpg) using string
1) Just use
lm(df2) or if
lm has additional columns beyond what is shown in the question but we just want to regress on
df3 <- df2[c("y", var, "x2")] lm(df3)
The following are optional and only apply if it is important that the formula appear in the output as if it had been explicitly given.
Compute the formula
fo using the first line below and then run
lm as in the second line:
fo <- formula(model.frame(df3)) fm <- do.call("lm", list(fo, quote(df3)))
or just run
lm as in the first line below and then write the formula into it as in the second line:
fm <- lm(df3) fm$call <- formula(model.frame(df3))
Either one gives this:
> fm Call: lm(formula = y ~ x1 + x2, data = df3) Coefficients: (Intercept) x1 x2 0.44752 0.04278 0.05011
2) character string
lm accepts a character string for the formula so this also works. The
fn$ causes substitution to occur in the character arguments.
library(gsubfn) fn$lm("y ~ $var + x2", quote(df2))
or at the expense of more involved code, without gsubfn:
do.call("lm", list(sprintf("y ~ %s + x2", var), quote(df2)))
or if you don't care that the formula displays without
var substituted then just:
lm(sprintf("y ~ %s + x2", var), df2)
Dynamic Documents with R and knitr, For example, linear regressions of mpg against two variables in the mtcars tags in a template, and dynamically evaluate them in the current environment. We write a template file as shown in Figure 12.10 and name it as mtcarstemplate. Variables in a data frame in R always need to have a name. To access the variable names, you can again treat a data frame like a matrix and use the function colnames () like this: > colnames (employ.data) "employee" "salary" "startdate" But, in fact, this is taking the long way around.
Parametric variable names and dplyr – Win Vector LLC, What is Chapter 8 of Practical Data Science with R? Site re-Org · Linear and Logistic Regression in Practical Data Science with R 2nd Edition $\begingroup$ @mpiktas In R, it is more natural to make a list, set its names parameter and later either just use it, attach it or convert it into an environment with list2env and eval inside it. With no loops, parse or other ugly stuff. $\endgroup$ – user88 May 16 '11 at 10:38
R Tip: How to Pass a formula to lm – Win Vector LLC, However the “call” portion of the model is reported as “ formula = f ” (the name of the variable carrying the formula) instead of something more Forecasting using R Rob J Hyndman 3.2 Dynamic regression. If necessary, apply same di˙erencing to all variables. Forecasting using R Regression with ARIMA errors 6.
Defining Dynamic Variable Names & Indexing : rstats, Is there a way to dynamically define new variable names? Given many similarly formatted data sets, I am interested in allowing R to define which field of data, but I have been unable to find a function that would allow such a regression. Multiple (Linear) Regression . R provides comprehensive support for multiple linear regression. The topics below are provided in order of increasing complexity. Fitting the Model # Multiple Linear Regression Example fit <- lm(y ~ x1 + x2 + x3, data=mydata) summary(fit) # show results # Other useful functions