## Compute mean and standard deviation by group for multiple variables in a data.frame

**Edit** -- This question was originally titled << Long to wide data reshaping in R >>

I'm just learning R and trying to find ways to apply it to help out others in my life. As a test case, I'm working on reshaping some data, and I'm having trouble following the examples I've found online. What I'm starting with looks like this:

ID Obs 1 Obs 2 Obs 3 1 43 48 37 1 27 29 22 1 36 32 40 2 33 38 36 2 29 32 27 2 32 31 35 2 25 28 24 3 45 47 42 3 38 40 36

And what I want to end up with will look like this:

ID Obs 1 mean Obs 1 std dev Obs 2 mean Obs 2 std dev 1 x x x x 2 x x x x 3 x x x x

And so forth. What I'm unsure of is whether I need additional information in my long-form data, or what. I imagine that the math part (finding the mean and standard deviations) will be the easy part, but I haven't been able to find a way that seems to work to reshape the data correctly to start in on that process.

Thanks very much for any help.

This is an aggregation problem, not a reshaping problem as the question originally suggested -- we wish to aggregate each column into a mean and standard deviation by ID. There are many packages that handle such problems. In the base of R it can be done using `aggregate`

like this (assuming `DF`

is the input data frame):

ag <- aggregate(. ~ ID, DF, function(x) c(mean = mean(x), sd = sd(x)))

*Note 1:* A commenter pointed out that `ag`

is a data frame for which some columns are matrices. Although initially that may seem strange, in fact it simplifies access. `ag`

has the same number of columns as the input `DF`

. Its first column `ag[[1]]`

is `ID`

and the ith column of the remainder `ag[[i+1]]`

(or equivalanetly `ag[-1][[i]]`

) is the matrix of statistics for the ith input observation column. If one wishes to access the jth statistic of the ith observation it is therefore `ag[[i+1]][, j]`

which can also be written as `ag[-1][[i]][, j]`

.

On the other hand, suppose there are `k`

statistic columns for each observation in the input (where k=2 in the question). Then if we flatten the output then to access the jth statistic of the ith observation column we must use the more complex `ag[[k*(i-1)+j+1]]`

or equivalently `ag[-1][[k*(i-1)+j]]`

.

For example, compare the simplicity of the first expression vs. the second:

ag[-1][[2]] ## mean sd ## [1,] 36.333 10.2144 ## [2,] 32.250 4.1932 ## [3,] 43.500 4.9497 ag_flat <- do.call("data.frame", ag) # flatten ag_flat[-1][, 2 * (2-1) + 1:2] ## Obs_2.mean Obs_2.sd ## 1 36.333 10.2144 ## 2 32.250 4.1932 ## 3 43.500 4.9497

**Note 2:** The input in reproducible form is:

Lines <- "ID Obs_1 Obs_2 Obs_3 1 43 48 37 1 27 29 22 1 36 32 40 2 33 38 36 2 29 32 27 2 32 31 35 2 25 28 24 3 45 47 42 3 38 40 36" DF <- read.table(text = Lines, header = TRUE)

**1.9 Subgroup analyses: finding means and standard deviations for ,** Third, we can create a new data frame for a particular subgroup using finds the mean of the variable 'agewalk' for those subjects with group Stack Overflow for Teams is a private, secure spot for you and your coworkers to find and share information. Learn more mean and standard deviation by group for multiple variables [duplicate]

There are a few different ways to go about it. `reshape2`

is a helpful package.
Personally, I like using `data.table`

Below is a step-by-step

If `myDF`

is your `data.frame`

:

library(data.table) DT <- data.table(myDF) DT # this will get you your mean and SD's for each column DT[, sapply(.SD, function(x) list(mean=mean(x), sd=sd(x)))] # adding a `by` argument will give you the groupings DT[, sapply(.SD, function(x) list(mean=mean(x), sd=sd(x))), by=ID] # If you would like to round the values: DT[, sapply(.SD, function(x) list(mean=round(mean(x), 3), sd=round(sd(x), 3))), by=ID] # If we want to add names to the columns wide <- setnames(DT[, sapply(.SD, function(x) list(mean=round(mean(x), 3), sd=round(sd(x), 3))), by=ID], c("ID", sapply(names(DT)[-1], paste0, c(".men", ".SD")))) wide ID Obs.1.men Obs.1.SD Obs.2.men Obs.2.SD Obs.3.men Obs.3.SD 1: 1 35.333 8.021 36.333 10.214 33.0 9.644 2: 2 29.750 3.594 32.250 4.193 30.5 5.916 3: 3 41.500 4.950 43.500 4.950 39.0 4.243

Also, this may or may not be helpful

> DT[, sapply(.SD, summary), .SDcols=names(DT)[-1]] Obs.1 Obs.2 Obs.3 Min. 25.00 28.00 22.00 1st Qu. 29.00 31.00 27.00 Median 33.00 32.00 36.00 Mean 34.22 36.11 33.22 3rd Qu. 38.00 40.00 37.00 Max. 45.00 48.00 42.00

**How to summarize data by group in R?,** setkey(dt,group) > system.time(dt[,list(mean=mean(age),sd=sd(age)) Since you are manipulating a data frame, the dplyr package is probably the faster way to do it. the mean and standard deviation based on a grouping variable. data.frame("age"=agedat,"group"=factor(groupdat)) # calculate mean Standard deviation Function in python pandas is used to calculate standard deviation of a given set of numbers, Standard deviation of a data frame, Standard deviation of column or column wise standard deviation in pandas and Standard deviation of rows, let’s see an example of each.

Here is probably the simplest way to go about it (with a reproducible example):

library(plyr) df <- data.frame(ID=rep(1:3, 3), Obs_1=rnorm(9), Obs_2=rnorm(9), Obs_3=rnorm(9)) ddply(df, .(ID), summarize, Obs_1_mean=mean(Obs_1), Obs_1_std_dev=sd(Obs_1), Obs_2_mean=mean(Obs_2), Obs_2_std_dev=sd(Obs_2)) ID Obs_1_mean Obs_1_std_dev Obs_2_mean Obs_2_std_dev 1 1 -0.13994642 0.8258445 -0.15186380 0.4251405 2 2 1.49982393 0.2282299 0.50816036 0.5812907 3 3 -0.09269806 0.6115075 -0.01943867 1.3348792

EDIT: The following approach saves you a lot of typing when dealing with many columns.

ddply(df, .(ID), colwise(mean)) ID Obs_1 Obs_2 Obs_3 1 1 -0.3748831 0.1787371 1.0749142 2 2 -1.0363973 0.0157575 -0.8826969 3 3 1.0721708 -1.1339571 -0.5983944 ddply(df, .(ID), colwise(sd)) ID Obs_1 Obs_2 Obs_3 1 1 0.8732498 0.4853133 0.5945867 2 2 0.2978193 1.0451626 0.5235572 3 3 0.4796820 0.7563216 1.4404602

**R - Mean, Median and Mode,** How do you find the mean of a specific variable in R? I am trying to calculate the number of samples, mean, standard deviation, coefficient of variation, lower and upper 95% confidence limits, and quartiles of this data set across each column and put it into a new data frame. The numbers below are not necessarily all correct & I didn't fill them all in, just provides an example.

I add the `dplyr`

solution.

set.seed(1) df <- data.frame(ID=rep(1:3, 3), Obs_1=rnorm(9), Obs_2=rnorm(9), Obs_3=rnorm(9)) library(dplyr) df %>% group_by(ID) %>% summarise_each(funs(mean, sd)) # ID Obs_1_mean Obs_2_mean Obs_3_mean Obs_1_sd Obs_2_sd Obs_3_sd # (int) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) # 1 1 0.4854187 -0.3238542 0.7410611 1.1108687 0.2885969 0.1067961 # 2 2 0.4171586 -0.2397030 0.2041125 0.2875411 1.8732682 0.3438338 # 3 3 -0.3601052 0.8195368 -0.4087233 0.8105370 0.3829833 1.4705692

**Summarizing data,** Find the mean, standard deviation, and count are collapsing over a within-subject variable. For each group's data frame, return a vector 96 standard. Standard deviation Function in python pandas is used to calculate standard deviation of a given set of numbers, Standard deviation of a data frame, Standard deviation of column and Standard deviation of rows, let's see an example of each. If N is even, the sample median is the average of the two middle values.

Here's another take on the `data.table`

answers, using @Carson's data, that's a bit more readable (and also a little faster, because of using `lapply`

instead of `sapply`

):

library(data.table) set.seed(1) dt = data.table(ID=c(1:3), Obs_1=rnorm(9), Obs_2=rnorm(9), Obs_3=rnorm(9)) dt[, c(mean = lapply(.SD, mean), sd = lapply(.SD, sd)), by = ID] # ID mean.Obs_1 mean.Obs_2 mean.Obs_3 sd.Obs_1 sd.Obs_2 sd.Obs_3 #1: 1 0.4854187 -0.3238542 0.7410611 1.1108687 0.2885969 0.1067961 #2: 2 0.4171586 -0.2397030 0.2041125 0.2875411 1.8732682 0.3438338 #3: 3 -0.3601052 0.8195368 -0.4087233 0.8105370 0.3829833 1.4705692

**Using R: quickly calculating summary statistics (with dplyr),** Here, we calculate mean and standard deviation of the values. Source: local data frame [8 x 5] Groups: sex, treatment sex treatment variable The standard deviation gives an idea of how close the entire set of data is to the average value. Data sets with a small standard deviation have tightly grouped, precise data. Data sets with large standard deviations have data spread out over a wide range of values. The formula for standard deviation is given below as Equation \ref{3}.

**Using R: quickly calculating summary statistics from a data frame ,** Using R: quickly calculating summary statistics from a data frame Now, calculating a function of the response in some group is straightforward. Say that we want mean, standard deviation and a simple standard error of the mean. Instead of the response variables separately we get a column of values I need to compute Standard Deviation of ALL Observations of MULTIPLE Variables. PROC MEANS can compute std of EACH variable. STD function can compute std of EACH observation. Is there a way to compute std of all observations (data points) of multiple variables? For example, I have 3 variables, var1, var2, var3 and 3 observations. var1 var2 var3

**Basic summary statistics by group,** Useful if the grouping variable is some experimental variable and data are to be x. a data.frame or matrix. See note for statsBy. group. a grouping variable or a In the case of matrix output with multiple grouping variables, the grouping mean standard deviation median mad: median absolute deviation (from the median) The input for the tapply( ) function is 1) the outcome variable (data vector) to be analyzed, 2) the categorical variable (data vector) that defines the subsets of subjects, and 3) the function to be applied to the outcome variable. To find the means, standard deviations, and n's for the two study groups in the 'kidswalk' data set:

**R Aggregate Function: Summarise & Group_by() Example,** Although, summarizing a variable by group gives better information on the and maximum; Count; First and last; nth observation; Multiple groups; Filter; Ungroup summarise(data, mean_run = mean(R)): Creates a variable named Spread in the data is computed with the standard deviation or sd() in R. step 3: find the mean for the grouped data by dividing the addition of multiplication of each group mid-point and frequency of the data set by the number of samples. step 4: calculate the variance for the frequency table data by using the above formula. step 5:estimate standard deviation for the frequency table by taking square root of the

##### Comments

- Just a comment: I don't think that's what folks usually mean by moving from long to wide format.
- Plenty have commented, but I am surprised no one cared to fix such a misleading title (now done.)
- Perhaps important to note: While the output of this will appear to be a
`data.frame`

with two columns for each column being aggregated (resulting in 7 columns with your example data), if you view the structure, you'll see that it is actually just four columns, with the aggregated columns being*matrices*. You can fix that with a`do.call(data.frame, aggregate(. ~ ID, DF, function(x) c(mean = mean(x), sd = sd(x))))`

. - @Ananda Mahto, Good point. I have added some comemnts elaborating on this.
- I tried this and got
`Error in var(if (is.vector(x) || is.factor(x)) x else as.double(x), na.rm = na.rm) : Calling var(x) on a factor x is defunct. Use something like 'all(duplicated(x)[-1L])' to test for a constant vector.`

Traceback showed that the problem was with the form of the call to`sapply`

. - There's one more observation you missed out. While this is the way to go with fewer columns, I think it gets ugly very quickly.
- can we calculate mean of rows using this method ?
- the second one should use
`sd`

and you use`.SD`

twice.. is there a performance issue due to that? any idea? - @Arun, thanks, fixed the
`sd`

bit. I don't know if there is a performance hit because of that, let me check - @Arun looks like there is an ~10% performance hit, but the good news is that it doesn't increase with more categories
- Also you'll see a optimisation message about creating names (mean, sd) for every
`by`

(which will be inefficient for huge data. I'm benchmarking on a 1e6 data.table. Will post the results shortly. - This works for me, however the resulting columns all have the same name, i.e.
`Obs_1`

,`Obs_2`

,`Obs_3`

,`Obs_1`

,`Obs_2`

,`Obs_3`

. not`mean.Obs_1`

... any ideas why that is the case?