Correlations between numerous variables grouped in dplyr

correlation by group in r
correlation matrix in r
subset correlation matrix in r
correlation between multiple variables in r
multiple correlation in r
correlation between two variables
correlation by group stata

Say I have a data frame, like this:

# Set RNG seed
set.seed(33550336)

# Create dummy data frame
df <- data.frame(PC1 = runif(20),
                 PC2 = runif(20),
                 PC3 = runif(20),
                 A = runif(20),
                 B = runif(20),
                 loc = sample(LETTERS[1:2], 20, replace = TRUE),
                 seas = sample(c("W", "S"), 20, replace = TRUE))

# > head(df)
#         PC1        PC2       PC3         A         B loc seas
# 1 0.8636470 0.02220823 0.7553348 0.4679607 0.0787467   A    S
# 2 0.3522257 0.42733152 0.2412971 0.6691419 0.1194121   A    W
# 3 0.5257408 0.44293320 0.3225228 0.0934192 0.2966507   B    S
# 4 0.0667227 0.90273594 0.6297959 0.1962124 0.4894373   A    W
# 5 0.3751383 0.50477920 0.6567203 0.4510632 0.4742191   B    S
# 6 0.9197086 0.32024904 0.8382138 0.9907894 0.9335657   A    S

I'm interested in calculating correlations between PC1, PC2, and PC3 and each of the variables A and B grouped by loc and seas. So, for example, based on this answer, I could do the following:

# Correlation of variable A and PC1 per loc & seas combination
df %>% 
  group_by(loc, seas) %>% 
  summarise(cor = cor(PC1, A)) %>% 
  ungroup

# # A tibble: 4 x 3
#   loc   seas      cor
#   <fct> <fct>   <dbl>
# 1 A     S      0.458 
# 2 A     W      0.748 
# 3 B     S     -0.0178
# 4 B     W     -0.450 

This gives me what I want: the correlation between PC1 and A for each combination of loc and seas. Awesome.

What I'm struggling with is extrapolating this to perform the calculation for each combination of PC* variables and other variables (i.e., A and B, in the example). My expected outcome is the tibble immediately above, but with a column for each combination for PC* and the other variables. I could do this long hand... cor(PC2, A), cor(PC3, A), cor(PC1, B), etc., but presumably there is there a succinct way of coding the calculation. I suspect it involves do, but I can't quite get my head around it... Can someone enlighten me?


Solution

I went with G. Grothendieck's solution below, but this required some restructuring to get it into the required format. I have posted the code I used here in case it is useful for others.

# Perform calculation
res <- by(df[1:5], df[-(1:5)], cor)

# Combinations of loc & seas 
comb <- expand.grid(dimnames(res))

#   loc seas
# 1   A    S
# 2   B    S
# 3   A    W
# 4   B    W

# A matrix corresponding to a loc & seas
# Plus the loc & seas themselves
restructure <- function(m, n){
  # Convert to data frame
  # Add rownames as column
  # Retains PCs as rows, but not columns
  # Gather variables to long format
  # Unite PC & variable names
  # Spread to a single row
  # Add combination of loc & seas
  m %>% 
    data.frame %>% 
    rownames_to_column() %>% 
    filter(grepl("PC", rownames(m))) %>% 
    select(-contains("PC")) %>% 
    gather(variable, value, -rowname) %>% 
    unite(comb, rowname, variable) %>% 
    spread(comb, value) %>% 
    bind_cols(n)
}

# Restructure each list element & combine into data frame
do.call(rbind, lapply(1:length(res), function(x)restructure(res[[x]], comb[x, ])))

which gives,

#         PC1_A       PC1_B      PC2_A       PC2_B      PC3_A     PC3_B loc seas
# 1  0.45763159 -0.00925106  0.3522161  0.20916667 -0.2003091 0.3741403   A    S
# 2 -0.01779813 -0.74328144 -0.3501188  0.46324158  0.8034240 0.4580262   B    S
# 3  0.74835455  0.49639477 -0.3994917 -0.05233889 -0.5902400 0.3606690   A    W
# 4 -0.45025181 -0.66721038 -0.9899521 -0.80989058  0.7606430 0.3738706   B    W

Use by like this:

By <- by(df[1:5], df[-(1:5)], cor)

giving:

> By
loc: A
seas: S
            PC1        PC2        PC3          A           B
PC1  1.00000000 -0.3941583  0.1872622  0.4576316 -0.00925106
PC2 -0.39415826  1.0000000 -0.6797708  0.3522161  0.20916667
PC3  0.18726218 -0.6797708  1.0000000 -0.2003091  0.37414025
A    0.45763159  0.3522161 -0.2003091  1.0000000  0.57292305
B   -0.00925106  0.2091667  0.3741403  0.5729230  1.00000000
----------------------------------------------------------------------------------------------------------------------------- 
loc: B
seas: S
            PC1         PC2         PC3           A          B
PC1  1.00000000 -0.52651449  0.07120701 -0.01779813 -0.7432814
PC2 -0.52651449  1.00000000 -0.05448583 -0.35011878  0.4632416
PC3  0.07120701 -0.05448583  1.00000000  0.80342399  0.4580262
A   -0.01779813 -0.35011878  0.80342399  1.00000000  0.5558740
B   -0.74328144  0.46324158  0.45802622  0.55587404  1.0000000
----------------------------------------------------------------------------------------------------------------------------- 
loc: A
seas: W
           PC1         PC2        PC3          A           B
PC1  1.0000000 -0.79784422  0.0932317  0.7483545  0.49639477
PC2 -0.7978442  1.00000000 -0.3526315 -0.3994917 -0.05233889
PC3  0.0932317 -0.35263151  1.0000000 -0.5902400  0.36066898
A    0.7483545 -0.39949171 -0.5902400  1.0000000  0.18081316
B    0.4963948 -0.05233889  0.3606690  0.1808132  1.00000000
----------------------------------------------------------------------------------------------------------------------------- 
loc: B
seas: W
           PC1        PC2        PC3          A          B
PC1  1.0000000  0.3441459  0.1135686 -0.4502518 -0.6672104
PC2  0.3441459  1.0000000 -0.8447551 -0.9899521 -0.8098906
PC3  0.1135686 -0.8447551  1.0000000  0.7606430  0.3738706
A   -0.4502518 -0.9899521  0.7606430  1.0000000  0.8832408
B   -0.6672104 -0.8098906  0.3738706  0.8832408  1.0000000
ADDED

Based on further discussion by poster on what is wanted define the onerow function which accepts a correlation matrix or a data frame (in the latter case it converts the first 5 columns to a correlatoin matrix) producing one row of the output. The if statement in onerow is not needed, but won't hurt, for the adply line of code but we have included it so that onerow also works in a simple manner in subsequent examples below as well.

library(plyr)

onerow <- function(x) {
  if (is.data.frame(x)) x <- cor(x[1:5])
  dtab <- as.data.frame.table(x[4:5, 1:3])
  with(dtab, setNames(Freq, paste(Var2, Var1, sep = "_")))
}

adply(By, 1:2, onerow)

giving:

  loc seas       PC1_A       PC1_B      PC2_A       PC2_B      PC3_A     PC3_B
1   A    S  0.45763159 -0.00925106  0.3522161  0.20916667 -0.2003091 0.3741403
2   B    S -0.01779813 -0.74328144 -0.3501188  0.46324158  0.8034240 0.4580262
3   A    W  0.74835455  0.49639477 -0.3994917 -0.05233889 -0.5902400 0.3606690
4   B    W -0.45025181 -0.66721038 -0.9899521 -0.80989058  0.7606430 0.3738706

or perhaps get rid of by altogether and use this giving the same output:

library(plyr)
ddply(df, -(1:5), onerow)

or using dplyr:

library(dplyr)
df %>%
  group_by_at(-(1:5)) %>%
  do( onerow(.) %>% t %>% as.data.frame ) %>%
  ungroup

Grouped correlation for more than two variables, correlations for multiple variables by group using tidyverse functions. to see the correlation stats between mpg, wt, and disp grouped by cyl for example. I am still more familiar with reshape2 but I will think about a tidyr  In order to have it include more variables, you should simply use this syntax: .(var1, var2, var3) . Which is like cutting your data by each combination of levels of var1, var2 and var3. And on each cut to perform your function.

We can do a split and cor in base R

lapply(split(df[1:5], df[-(1:5)]), cor)
#$A.S
#            PC1        PC2        PC3          A           B
#PC1  1.00000000 -0.3941583  0.1872622  0.4576316 -0.00925106
#PC2 -0.39415826  1.0000000 -0.6797708  0.3522161  0.20916667
#PC3  0.18726218 -0.6797708  1.0000000 -0.2003091  0.37414025
#A    0.45763159  0.3522161 -0.2003091  1.0000000  0.57292305
#B   -0.00925106  0.2091667  0.3741403  0.5729230  1.00000000

#$B.S
#            PC1         PC2         PC3           A          B
#PC1  1.00000000 -0.52651449  0.07120701 -0.01779813 -0.7432814
#PC2 -0.52651449  1.00000000 -0.05448583 -0.35011878  0.4632416
#PC3  0.07120701 -0.05448583  1.00000000  0.80342399  0.4580262
#A   -0.01779813 -0.35011878  0.80342399  1.00000000  0.5558740
#B   -0.74328144  0.46324158  0.45802622  0.55587404  1.0000000

#$A.W
#           PC1         PC2        PC3          A           B
#PC1  1.0000000 -0.79784422  0.0932317  0.7483545  0.49639477
#PC2 -0.7978442  1.00000000 -0.3526315 -0.3994917 -0.05233889
#PC3  0.0932317 -0.35263151  1.0000000 -0.5902400  0.36066898
#A    0.7483545 -0.39949171 -0.5902400  1.0000000  0.18081316
#B    0.4963948 -0.05233889  0.3606690  0.1808132  1.00000000

#$B.W
#           PC1        PC2        PC3          A          B
#PC1  1.0000000  0.3441459  0.1135686 -0.4502518 -0.6672104
#PC2  0.3441459  1.0000000 -0.8447551 -0.9899521 -0.8098906
#PC3  0.1135686 -0.8447551  1.0000000  0.7606430  0.3738706
#A   -0.4502518 -0.9899521  0.7606430  1.0000000  0.8832408
#B   -0.6672104 -0.8098906  0.3738706  0.8832408  1.0000000

Or using tidyverse

library(tidyverse)
df %>% 
    group_by_at(6:7) %>% 
    nest %>% 
    mutate(data = map(data, cor)) 

R: compute correlation by group, How do I compute the correlation between M1 and M2 within each class? group COR 1 1 0.05152923 2 2 -0.15066838 3 3 -0.04717481 4 4 Using data.​table is shorter than dplyr Here's a similar method that will give you a table with the n's and p values for each correlation as well (rounded to 3 decimal places for​  A major strength of dplyr is the ability to group the data by a variable or variables and then operate on the data "by group". With plyr you can do much the same using the ddply function or it's relatives, dlply and daply. However, there are advantages to having grouped data as an object in its own right.

Here is a solution via tidyverse where we use summarise_at to specify all PC[0-9] and correlate them with A. Same procedure for B, and then simply merge, i.e.

library(tidyverse)

df %>% 
 group_by(loc, seas) %>% 
 summarise_at(vars(starts_with('PC')), funs(cor(., A))) %>% 
 left_join(., df %>% 
                 group_by(loc, seas) %>% 
                 summarise_at(vars(starts_with('PC')), funs(cor(., B))), 
          by = c('loc', 'seas'), suffix = c('.A', '.B'))

which gives,

# A tibble: 4 x 8
# Groups:   loc [?]
  loc   seas    PC1.A  PC2.A  PC3.A    PC1.B   PC2.B PC3.B
  <fct> <fct>   <dbl>  <dbl>  <dbl>    <dbl>   <dbl> <dbl>
1 A     S      0.458   0.352 -0.200 -0.00925  0.209  0.374
2 A     W      0.748  -0.399 -0.590  0.496   -0.0523 0.361
3 B     S     -0.0178 -0.350  0.803 -0.743    0.463  0.458
4 B     W     -0.450  -0.990  0.761 -0.667   -0.810  0.374

5 Data transformation, It tells you that dplyr overwrites some functions in base R. If you want to use the base To explore the basic data manipulation verbs of dplyr, we'll use nycflights13::flights . Collapse many values down to a single summary ( summarise() ). function from operating on the entire dataset to operating on it group-by-group. For example, I can get correlations for two variables like below, but I don't know how to do it for more than two or even all the variables in the dataset. I'd like to be able to see correlations for any number of selected variables by group i.e. if I wanted to see the correlation stats between mpg, wt, and disp grouped by cyl for example.

Simple Correlation Analysis in R using Tidyverse Principles , R's standard correlation functionality (base::cor) seems very to simulteneously select the columns and filter the rows of the variables focused on. mtcars %>% corrr::correlate() %>% corrr::focus(mpg) %>% dplyr::mutate(rowname a correlation data frame # to group highly correlated variables closer  In group_by(), variables or computations to group by. In ungroup(), variables to remove from the grouping..add: When FALSE, the default, group_by() will override existing groups. To add to the existing groups, use .add = TRUE. This argument was previously called add, but that prevented creating a new grouping variable called add, and conflicts

Tidy correlation tests in R, When we try to estimate the correlation coefficient between multiple variables, tidyverse, Collection of packages (visualization, manipulation): ggplot2, dplyr, Classes 'grouped_df', 'tbl_df', 'tbl' and 'data.frame': 40 obs. of 3  A correlation matrix is a matrix that represents the pair correlation of all the variables. The cor() function returns a correlation matrix. The only difference with the bivariate correlation is we don't need to specify which variables. By default, R computes the correlation between all the variables.

Grouped Statistical Analyses in a Tidy Way • groupedstats, For more, see: https://dplyr.tidyverse.org/reference/group_map.html groupedstats package provides a collection of functions to run statistical operations on (e.g., correlation between subjective rating of emotional intensity and reaction time). Getting summary for multiple variables across multiple grouping variables. Correlation between groups of variables: some measure, assuming that each group reflects one overall trait, of how each trait (group) is related to every other trait. These characteristics have been previously classified into groups.

Comments
  • I m not sure but those seem to be eigenvectors :)
  • Thanks! I went with this solution, but have added the code used to restructure these results into the required format to the question.
  • I like this solution because it gives the result in the required format, but can it be generalised to an arbitrary number of variables without having to add another left_join... line for each?
  • It can. We will have to loop over the columns A, B, ... I can give it a shot tomorrow.