Sum up two variables in a long-format dataframe with tidyverse
dplyr : : cheat sheet
tidyverse cheat sheet
dplyr long to wide
I have a simple data frame in a tidy format:
group variable value <fct> <chr> <dbl> 1 fishers_here 100 1 money_per_fisher 2000 1 unnecessary_variable 10 2 fishers_here 140 2 money_per_fisher 8000 2 unnecessary_variable 304 3 fishers_here 10 3 money_per_fisher 9000 ....
for each group I'd like to have the variable "total money in group" which is just
money_per_fisher; basically I'd like it to look like this:
group variable value <fct> <chr> <dbl> 1 fishers_here 100 1 money_per_fisher 2000 1 unnecessary_variable 10 1 TOTAL_MONEY 200000 ....
Is there a simple way to get this done with tidyverse? By simple I mean without having to filter, summarise, add the variable column back in and then join the two now separate dataframes.
spread, do the multiplication and then
gather back up. Note I'm assuming that there is a typo in the group number in row 6 as I commented, where it should be group 2 instead of group 1. If that's not the case, then some additional cleaning steps are needed. You can also sort your resulting rows however you want (e.g. to put the rows for each group back together)
library(tidyverse) tbl <- read_table2( "group variable value 1 fishers_here 100 1 money_per_fisher 2000 1 unnecessary_variable 10 2 fishers_here 140 2 money_per_fisher 8000 2 unnecessary_variable 304 3 fishers_here 10 3 money_per_fisher 9000" ) tbl %>% spread(variable, value) %>% mutate(total_money_in_group = money_per_fisher * fishers_here) %>% gather(variable, value, -group) #> # A tibble: 12 x 3 #> group variable value #> <dbl> <chr> <dbl> #> 1 1 fishers_here 100 #> 2 2 fishers_here 140 #> 3 3 fishers_here 10 #> 4 1 money_per_fisher 2000 #> 5 2 money_per_fisher 8000 #> 6 3 money_per_fisher 9000 #> 7 1 unnecessary_variable 10 #> 8 2 unnecessary_variable 304 #> 9 3 unnecessary_variable NA #> 10 1 total_money_in_group 200000 #> 11 2 total_money_in_group 1120000 #> 12 3 total_money_in_group 90000
Created on 2019-02-04 by the reprex package (v0.2.1)
Manipulating, analyzing and exporting data with tidyverse, Add new columns to a data frame that are functions of existing columns with mutate Describe the concept of a wide and a long table format and for which purpose This is an “umbrella-package” that installs several packages useful for data Although I found an answer every time, yet it was impossible to remember when I needed since I did not fully understand how transforming the dataset works. I prefer to use tidyverse package for this task, but I know that reshape package works as well. First, I create a dataset with 9 rows and 3 IDs in long format to take as an example.
An option would be to
filter the 'money_per_fisher', 'fishers_here', grouped by 'group',
summarise to get the
prod of 'value', bind the rows with the original data and
arrange by 'group'
library(tidyverse) df1 %>% filter(variable %in% c('fishers_here', 'money_per_fisher')) %>% group_by(group) %>% summarise(variable = "total_money_in_group", value = prod(value)) %>% bind_rows(tbl, .) %>% arrange(group) # A tibble: 11 x 3 # group variable value # <int> <chr> <dbl> # 1 1 fishers_here 100 # 2 1 money_per_fisher 2000 # 3 1 unnecessary_variable 10 # 4 1 total_money_in_group 200000 # 5 2 fishers_here 140 # 6 2 money_per_fisher 8000 # 7 2 unnecessary_variable 304 # 8 2 total_money_in_group 1120000 # 9 3 fishers_here 10 #10 3 money_per_fisher 9000 #11 3 total_money_in_group 90000
df1 <- structure(list(group = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L), variable = c("fishers_here", "money_per_fisher", "unnecessary_variable", "fishers_here", "money_per_fisher", "unnecessary_variable", "fishers_here", "money_per_fisher"), value = c(100L, 2000L, 10L, 140L, 8000L, 304L, 10L, 9000L )), class = "data.frame", row.names = c(NA, -8L))
Tidying/reshaping tables using tidyr, Data tables come in different sizes and shape; they can be a very simple two column names_to : This is the name of the new column which will combine all column If a table is to be used for a visual assessment of the values, a long format Continuing with the df2.long dataframe, we can spread the long table back to a summarise() creates a new data frame. It will have one (or more) rows for each combination of grouping variables; if there are no grouping variables, the output will have a single row summarising all observations in the input. It will contain one column for each grouping variable and one column for each of the summary statistics that you have specified. summarise() and summarize() are synonyms.
Based on your output I think this is a possible solution:
df %>% group_by(group) %>% summarise(value = prod(value))
Edit: If you want a column on the original dataset you can use
mutate instead of
Reshaping Your Data with tidyr · UC Business Analytics R , Objective: Reshaping wide format to long format Arguments: data: data frame key: column name representing new variable value: column Description: There may be a time in which we would like to combine the values of two variables. Add new columns to a data frame that are functions of existing columns with mutate. Use the split-apply-combine concept for data analysis. Use summarize, group_by, and count to split a data frame into groups of observations, apply summary statistics for each group, and then combine the results.
Expand data frame to include all possible combinations of values , expand() generates all combination of variables found in a dataset. Specification of columns to expand. You can combine the two forms. For example, expand(df, nesting(school_id, student_id), date) would produce a row When used with factors, expand() uses the full set of levels, not just those that appear in the data. A data frame to pivot. cols <tidy-select> Columns to pivot into longer format. names_to: A string specifying the name of the column to create from the data stored in the column names of data. Can be a character vector, creating multiple columns, if names_sep or names_pattern is provided. In this case, there are two special values you can take
Summarise each group to fewer rows, A data frame, to add multiple columns from a single expression. .groups. Experimental lifecycle Grouping structure of the result. "drop_last": dropping the last A data frame. By default, the newly created columns have the shortest names needed to uniquely identify the output. To force inclusion of a name, even when not needed, name the input (see examples for details). Grouping variables. If applied on a grouped tibble, these operations are not applied to the grouping variables
Complete a data frame with missing combinations of data , Complete a data frame with missing combinations of data Turns implicit missing values into explicit missing values. You can combine the two forms. For example, expand(df, nesting(school_id, student_id), date) would produce a row for When used with factors, expand() uses the full set of levels, not just those that arrange() order the rows of a data frame rows by the values of selected columns. Unlike other dplyr verbs, arrange() largely ignores grouping; you need to explicit mention grouping variables (or use by_group = TRUE) in order to group by them, and functions of variables are evaluated once per data frame, not once per group.
- let me rephrase it for added clarity
- No. The easiest would be to summarize and merge. None of the verbs other than the joins make it possible to add new rows. You could maybe use the
do()but not sure how recommended that is any more.
- Is there a typo in row 6, where it should be group 1? or are there actually duplicate rows
- thanks for this; it won't work since it'll multiply over variables that are not of interest. Of course I could filter first, but I was hoping to avoid having to do all that work and then having to left_join the two dataframes later on.