How to sum a variable by group

r sum multiple columns by group
r aggregate sum multiple columns
r conditional sum by group
group by in r dplyr
r group by count
r sum by group dplyr
ggplot sum by group
group by multiple columns in r

Let's say I have two columns of data. The first contains categories such as "First", "Second", "Third", etc. The second has numbers which represent the number of times I saw "First".

For example:

Category     Frequency
First        10
First        15
First        5
Second       2
Third        14
Third        20
Second       3

I want to sort the data by Category and sum the Frequencies:

Category     Frequency
First        30
Second       5
Third        34

How would I do this in R?

Using aggregate:

aggregate(x$Frequency, by=list(Category=x$Category), FUN=sum)
  Category  x
1    First 30
2   Second  5
3    Third 34

In the example above, multiple dimensions can be specified in the list. Multiple aggregated metrics of the same data type can be incorporated via cbind:

aggregate(cbind(x$Frequency, x$Metric2, x$Metric3) ...

(embedding @thelatemail comment), aggregate has a formula interface too

aggregate(Frequency ~ Category, x, sum)

Or if you want to aggregate multiple columns, you could use the . notation (works for one column too)

aggregate(. ~ Category, x, sum)

or tapply:

tapply(x$Frequency, x$Category, FUN=sum)
 First Second  Third 
    30      5     34 

Using this data:

x <- data.frame(Category=factor(c("First", "First", "First", "Second",
                                      "Third", "Third", "Second")), 
                    Frequency=c(10,15,5,2,14,20,3))

How to sum a variable by group in R?, I have a data frame of consisting two columns "Players" & "points" x<-data.frame(​Players=c(" Players x 1 Player1 28 2 Player2 33 3 Player3  Table 1: The Iris Data Set (First Six Rows). Table 1 shows the structure of the Iris data set. The data matrix consists of several numeric columns as well as of the grouping variable Species. In the following examples, we will compute the sum of the first column vector Sepal.Length within each Species group.

You can also use the dplyr package for that purpose:

library(dplyr)
x %>% 
  group_by(Category) %>% 
  summarise(Frequency = sum(Frequency))

#Source: local data frame [3 x 2]
#
#  Category Frequency
#1    First        30
#2   Second         5
#3    Third        34

Or, for multiple summary columns (works with one column too):

x %>% 
  group_by(Category) %>% 
  summarise_all(funs(sum))

Here are some more examples of how to summarise data by group using dplyr functions using the built-in dataset mtcars:

# several summary columns with arbitrary names
mtcars %>% 
  group_by(cyl, gear) %>%                            # multiple group columns
  summarise(max_hp = max(hp), mean_mpg = mean(mpg))  # multiple summary columns

# summarise all columns except grouping columns using "sum" 
mtcars %>% 
  group_by(cyl) %>% 
  summarise_all(sum)

# summarise all columns except grouping columns using "sum" and "mean"
mtcars %>% 
  group_by(cyl) %>% 
  summarise_all(funs(sum, mean))

# multiple grouping columns
mtcars %>% 
  group_by(cyl, gear) %>% 
  summarise_all(funs(sum, mean))

# summarise specific variables, not all
mtcars %>% 
  group_by(cyl, gear) %>% 
  summarise_at(vars(qsec, mpg, wt), funs(sum, mean))

# summarise specific variables (numeric columns except grouping columns)
mtcars %>% 
  group_by(gear) %>% 
  summarise_if(is.numeric, funs(mean))

For more information, including the %>% operator, see the introduction to dplyr.

SQL SUM() with GROUP by, How can I group by and sum a column in Excel? To subtotal data by group or label, directly in a table, you can use a formula based on the SUMIF function. Note: data must be sorted by the grouping column to get sensible results. The framework of this formula is based on IF, which tests each value in column B to see if its the same as the value in the "cell above".

The answer provided by rcs works and is simple. However, if you are handling larger datasets and need a performance boost there is a faster alternative:

library(data.table)
data = data.table(Category=c("First","First","First","Second","Third", "Third", "Second"), 
                  Frequency=c(10,15,5,2,14,20,3))
data[, sum(Frequency), by = Category]
#    Category V1
# 1:    First 30
# 2:   Second  5
# 3:    Third 34
system.time(data[, sum(Frequency), by = Category] )
# user    system   elapsed 
# 0.008     0.001     0.009 

Let's compare that to the same thing using data.frame and the above above:

data = data.frame(Category=c("First","First","First","Second","Third", "Third", "Second"),
                  Frequency=c(10,15,5,2,14,20,3))
system.time(aggregate(data$Frequency, by=list(Category=data$Category), FUN=sum))
# user    system   elapsed 
# 0.008     0.000     0.015 

And if you want to keep the column this is the syntax:

data[,list(Frequency=sum(Frequency)),by=Category]
#    Category Frequency
# 1:    First        30
# 2:   Second         5
# 3:    Third        34

The difference will become more noticeable with larger datasets, as the code below demonstrates:

data = data.table(Category=rep(c("First", "Second", "Third"), 100000),
                  Frequency=rnorm(100000))
system.time( data[,sum(Frequency),by=Category] )
# user    system   elapsed 
# 0.055     0.004     0.059 
data = data.frame(Category=rep(c("First", "Second", "Third"), 100000), 
                  Frequency=rnorm(100000))
system.time( aggregate(data$Frequency, by=list(Category=data$Category), FUN=sum) )
# user    system   elapsed 
# 0.287     0.010     0.296 

For multiple aggregations, you can combine lapply and .SD as follows

data[, lapply(.SD, sum), by = Category]
#    Category Frequency
# 1:    First        30
# 2:   Second         5
# 3:    Third        34

How to sum values by group in Excel?, This is a quick tutorial on how to sum a variable by group in R using the dplyr package group_by Duration: 3:43 Posted: Sep 29, 2017 SUM() function with group by. SUM is used with a GROUP BY clause. The aggregate functions summarize the table data. Once the rows are divided into groups, the aggregate functions are applied in order to return just one value per group.

This is somewhat related to this question.

You can also just use the by() function:

x2 <- by(x$Frequency, x$Category, sum)
do.call(rbind,as.list(x2))

Those other packages (plyr, reshape) have the benefit of returning a data.frame, but it's worth being familiar with by() since it's a base function.

How to sum a variable by group in R, How to compute the sum of a variable by group - 2 example codes - Base R (​aggregate Duration: 3:06 Posted: Aug 6, 2019 Your [Total expenses amount] is probably something like a sum over a column in a fact table, say SUM(Expenses[Amount]). If you create relationships from the Expenses table to the Projects table and the Employee table, you can create a report with fields from Project and Employee (and Billing status for that matter) and the total expenses amount will be filtered according to the fields selected.

library(plyr)
ddply(tbl, .(Category), summarise, sum = sum(Frequency))

Sum by Group in R (2 Examples), SQL SUM() with GROUP by: SUM is used with a GROUP BY clause. The GROUP BY clause is required when using an aggregate function along with regular column Next: SUM and COUNT Using Variable and inner join. Convert the 'data.frame' to 'data.table' (setDT(data)), grouped by 'group', get the sum of each columns in the Subset of data.table, and then with Reduce, get the sum of the rows of the columns of interest Or with base R Or with dplyr answered Aug 11 '17 at 16:21

R Aggregate Function: Summarise & Group_by() Example, Although, summarizing a variable by group gives better information on Subsetting; Sum; Standard deviation; Minimum and maximum; Count  SUM() and COUNT() functions. SUM of values of a field or column of a SQL table, generated using SQL SUM() function can be stored in a variable or temporary column referred as alias. The same approach can be used with SQL COUNT() function too.

Solved: Summing multiple variables by group, Solved: Hello, I'm working to create a function that will sum multiple variables by groups. I know how to sum one variable by group using a  I am using this data frame: I want to aggregate this by name and then by fruit to get a total number of fruit per name. I tried grouping by Name and Fruit but how do I get the total number of fruit. How can pandas knows that I want to sum the col named Number ? – Kingname Oct 23 '17 at 12:32. Date is not summed because it has dtype = string yes?

How to Aggregate Data in R, The variable to group by within the data; The calculation to apply to the groups (​what you want to find out). Example data. The raw data shown  Obtaining a Total for Each BY Group An additional requirement of Tradewinds Travel is to determine the number of tours that are scheduled with each vendor. In order to accomplish this task, a program must group the data by a variable; that is, the program must organize the data set into groups of observations, with one group for each vendor.

Comments
  • The fastest way in base R is rowsum.
  • @AndrewMcKinlay, R uses the tilde to define symbolic formulae, for statistics and other functions. It can be interpreted as "model Frequency by Category" or "Frequency depending on Category". Not all languages use a special operator to define a symbolic function, as done in R here. Perhaps with that "natural-language interpretation" of the tilde operator, it becomes more meaningful (and even intuitive). I personally find this symbolic formula representation better than some of the more verbose alternatives.
  • Being new to R (and asking the same sorts of questions as the OP), I would benefit from some more detail of the syntax behind each alternative. For instance, if I have a larger source table and want to subselect just two dimensions plus summed metrics, can I adapt any of these methods? Hard to tell.
  • How fast is it when compared to the data.table and aggregate alternatives presented in other answers?
  • @asieira, Which is fastest and how big the difference (or if the difference is noticeable) is will always depend on your data size. Typically, for large data sets, for example some GB, data.table will most likely be fastest. On smaller data size, data.table and dplyr are often close, also depending on the number of groups. Both data,table and dplyr will be quite a lot faster than base functions, however (can well be 100-1000 times faster for some operations). Also see here
  • What does the "funs" refer to in the second example?
  • @lauren.marietta you can specify the function(s) you want to apply as summary inside the funs() argument of summarise_all and its related functions (summarise_at, summarise_if)
  • +1 But 0.296 vs 0.059 isn't particularly impressive. The data size needs to be much bigger than 300k rows, and with more than 3 groups, for data.table to shine. We'll try and support more than 2 billion rows soon for example, since some data.table users have 250GB of RAM and GNU R now supports length > 2^31.
  • True. Turns out I don't have all that RAM though, and was simply trying to provide some evidence of data.table's superior performance. I'm sure the difference would be even larger with more data.
  • I had 7 mil observations dplyr took .3 seconds and aggregate() took 22 seconds to complete the operation. I was going to post it on this topic and you beat me to it!
  • There is a even shorter way to write this data[, sum(Frequency), by = Category]. You could use .N which substitutes the sum() function. data[, .N, by = Category]. Here is a useful cheatsheet: s3.amazonaws.com/assets.datacamp.com/img/blog/…
  • Using .N would be equivalent to sum(Frequency) only if all the values in the Frequency column were equal to 1, because .N counts the number of rows in each aggregated set (.SD). And that is not the case here.