How to sum a variable by group
r aggregate sum multiple columns
r conditional sum by group
group by in r dplyr
r group by count
r sum by group dplyr
ggplot sum by group
group by multiple columns in r
Let's say I have two columns of data. The first contains categories such as "First", "Second", "Third", etc. The second has numbers which represent the number of times I saw "First".
Category Frequency First 10 First 15 First 5 Second 2 Third 14 Third 20 Second 3
I want to sort the data by Category and sum the Frequencies:
Category Frequency First 30 Second 5 Third 34
How would I do this in R?
aggregate(x$Frequency, by=list(Category=x$Category), FUN=sum) Category x 1 First 30 2 Second 5 3 Third 34
In the example above, multiple dimensions can be specified in the
list. Multiple aggregated metrics of the same data type can be incorporated via
aggregate(cbind(x$Frequency, x$Metric2, x$Metric3) ...
(embedding @thelatemail comment),
aggregate has a formula interface too
aggregate(Frequency ~ Category, x, sum)
Or if you want to aggregate multiple columns, you could use the
. notation (works for one column too)
aggregate(. ~ Category, x, sum)
tapply(x$Frequency, x$Category, FUN=sum) First Second Third 30 5 34
Using this data:
x <- data.frame(Category=factor(c("First", "First", "First", "Second", "Third", "Third", "Second")), Frequency=c(10,15,5,2,14,20,3))
How to sum a variable by group in R?, I have a data frame of consisting two columns "Players" & "points" x<-data.frame(Players=c(" Players x 1 Player1 28 2 Player2 33 3 Player3 Table 1: The Iris Data Set (First Six Rows). Table 1 shows the structure of the Iris data set. The data matrix consists of several numeric columns as well as of the grouping variable Species. In the following examples, we will compute the sum of the first column vector Sepal.Length within each Species group.
You can also use the dplyr package for that purpose:
library(dplyr) x %>% group_by(Category) %>% summarise(Frequency = sum(Frequency)) #Source: local data frame [3 x 2] # # Category Frequency #1 First 30 #2 Second 5 #3 Third 34
Or, for multiple summary columns (works with one column too):
x %>% group_by(Category) %>% summarise_all(funs(sum))
Here are some more examples of how to summarise data by group using dplyr functions using the built-in dataset
# several summary columns with arbitrary names mtcars %>% group_by(cyl, gear) %>% # multiple group columns summarise(max_hp = max(hp), mean_mpg = mean(mpg)) # multiple summary columns # summarise all columns except grouping columns using "sum" mtcars %>% group_by(cyl) %>% summarise_all(sum) # summarise all columns except grouping columns using "sum" and "mean" mtcars %>% group_by(cyl) %>% summarise_all(funs(sum, mean)) # multiple grouping columns mtcars %>% group_by(cyl, gear) %>% summarise_all(funs(sum, mean)) # summarise specific variables, not all mtcars %>% group_by(cyl, gear) %>% summarise_at(vars(qsec, mpg, wt), funs(sum, mean)) # summarise specific variables (numeric columns except grouping columns) mtcars %>% group_by(gear) %>% summarise_if(is.numeric, funs(mean))
For more information, including the
%>% operator, see the introduction to dplyr.
SQL SUM() with GROUP by, How can I group by and sum a column in Excel? To subtotal data by group or label, directly in a table, you can use a formula based on the SUMIF function. Note: data must be sorted by the grouping column to get sensible results. The framework of this formula is based on IF, which tests each value in column B to see if its the same as the value in the "cell above".
The answer provided by rcs works and is simple. However, if you are handling larger datasets and need a performance boost there is a faster alternative:
library(data.table) data = data.table(Category=c("First","First","First","Second","Third", "Third", "Second"), Frequency=c(10,15,5,2,14,20,3)) data[, sum(Frequency), by = Category] # Category V1 # 1: First 30 # 2: Second 5 # 3: Third 34 system.time(data[, sum(Frequency), by = Category] ) # user system elapsed # 0.008 0.001 0.009
Let's compare that to the same thing using data.frame and the above above:
data = data.frame(Category=c("First","First","First","Second","Third", "Third", "Second"), Frequency=c(10,15,5,2,14,20,3)) system.time(aggregate(data$Frequency, by=list(Category=data$Category), FUN=sum)) # user system elapsed # 0.008 0.000 0.015
And if you want to keep the column this is the syntax:
data[,list(Frequency=sum(Frequency)),by=Category] # Category Frequency # 1: First 30 # 2: Second 5 # 3: Third 34
The difference will become more noticeable with larger datasets, as the code below demonstrates:
data = data.table(Category=rep(c("First", "Second", "Third"), 100000), Frequency=rnorm(100000)) system.time( data[,sum(Frequency),by=Category] ) # user system elapsed # 0.055 0.004 0.059 data = data.frame(Category=rep(c("First", "Second", "Third"), 100000), Frequency=rnorm(100000)) system.time( aggregate(data$Frequency, by=list(Category=data$Category), FUN=sum) ) # user system elapsed # 0.287 0.010 0.296
For multiple aggregations, you can combine
.SD as follows
data[, lapply(.SD, sum), by = Category] # Category Frequency # 1: First 30 # 2: Second 5 # 3: Third 34
How to sum values by group in Excel?, This is a quick tutorial on how to sum a variable by group in R using the dplyr package group_by Duration: 3:43 Posted: Sep 29, 2017 SUM() function with group by. SUM is used with a GROUP BY clause. The aggregate functions summarize the table data. Once the rows are divided into groups, the aggregate functions are applied in order to return just one value per group.
This is somewhat related to this question.
You can also just use the by() function:
x2 <- by(x$Frequency, x$Category, sum) do.call(rbind,as.list(x2))
Those other packages (plyr, reshape) have the benefit of returning a data.frame, but it's worth being familiar with by() since it's a base function.
How to sum a variable by group in R, How to compute the sum of a variable by group - 2 example codes - Base R (aggregate Duration: 3:06 Posted: Aug 6, 2019 Your [Total expenses amount] is probably something like a sum over a column in a fact table, say SUM(Expenses[Amount]). If you create relationships from the Expenses table to the Projects table and the Employee table, you can create a report with fields from Project and Employee (and Billing status for that matter) and the total expenses amount will be filtered according to the fields selected.
library(plyr) ddply(tbl, .(Category), summarise, sum = sum(Frequency))
Sum by Group in R (2 Examples), SQL SUM() with GROUP by: SUM is used with a GROUP BY clause. The GROUP BY clause is required when using an aggregate function along with regular column Next: SUM and COUNT Using Variable and inner join. Convert the 'data.frame' to 'data.table' (setDT(data)), grouped by 'group', get the sum of each columns in the Subset of data.table, and then with Reduce, get the sum of the rows of the columns of interest Or with base R Or with dplyr answered Aug 11 '17 at 16:21
R Aggregate Function: Summarise & Group_by() Example, Although, summarizing a variable by group gives better information on Subsetting; Sum; Standard deviation; Minimum and maximum; Count SUM() and COUNT() functions. SUM of values of a field or column of a SQL table, generated using SQL SUM() function can be stored in a variable or temporary column referred as alias. The same approach can be used with SQL COUNT() function too.
Solved: Summing multiple variables by group, Solved: Hello, I'm working to create a function that will sum multiple variables by groups. I know how to sum one variable by group using a I am using this data frame: I want to aggregate this by name and then by fruit to get a total number of fruit per name. I tried grouping by Name and Fruit but how do I get the total number of fruit. How can pandas knows that I want to sum the col named Number ? – Kingname Oct 23 '17 at 12:32. Date is not summed because it has dtype = string yes?
How to Aggregate Data in R, The variable to group by within the data; The calculation to apply to the groups (what you want to find out). Example data. The raw data shown Obtaining a Total for Each BY Group An additional requirement of Tradewinds Travel is to determine the number of tours that are scheduled with each vendor. In order to accomplish this task, a program must group the data by a variable; that is, the program must organize the data set into groups of observations, with one group for each vendor.
- The fastest way in base R is
- @AndrewMcKinlay, R uses the tilde to define symbolic formulae, for statistics and other functions. It can be interpreted as "model Frequency by Category" or "Frequency depending on Category". Not all languages use a special operator to define a symbolic function, as done in R here. Perhaps with that "natural-language interpretation" of the tilde operator, it becomes more meaningful (and even intuitive). I personally find this symbolic formula representation better than some of the more verbose alternatives.
- Being new to R (and asking the same sorts of questions as the OP), I would benefit from some more detail of the syntax behind each alternative. For instance, if I have a larger source table and want to subselect just two dimensions plus summed metrics, can I adapt any of these methods? Hard to tell.
- How fast is it when compared to the data.table and aggregate alternatives presented in other answers?
- @asieira, Which is fastest and how big the difference (or if the difference is noticeable) is will always depend on your data size. Typically, for large data sets, for example some GB, data.table will most likely be fastest. On smaller data size, data.table and dplyr are often close, also depending on the number of groups. Both data,table and dplyr will be quite a lot faster than base functions, however (can well be 100-1000 times faster for some operations). Also see here
- What does the "funs" refer to in the second example?
- @lauren.marietta you can specify the function(s) you want to apply as summary inside the
summarise_alland its related functions (
- +1 But 0.296 vs 0.059 isn't particularly impressive. The data size needs to be much bigger than 300k rows, and with more than 3 groups, for data.table to shine. We'll try and support more than 2 billion rows soon for example, since some data.table users have 250GB of RAM and GNU R now supports length > 2^31.
- True. Turns out I don't have all that RAM though, and was simply trying to provide some evidence of data.table's superior performance. I'm sure the difference would be even larger with more data.
- I had 7 mil observations dplyr took .3 seconds and aggregate() took 22 seconds to complete the operation. I was going to post it on this topic and you beat me to it!
- There is a even shorter way to write this
data[, sum(Frequency), by = Category]. You could use
.Nwhich substitutes the
data[, .N, by = Category]. Here is a useful cheatsheet: s3.amazonaws.com/assets.datacamp.com/img/blog/…
- Using .N would be equivalent to sum(Frequency) only if all the values in the Frequency column were equal to 1, because .N counts the number of rows in each aggregated set (.SD). And that is not the case here.