Sum up two variables in a long-format dataframe with tidyverse

r reshape wide to long multiple variables
dplyr : : cheat sheet
dplyr gather
tidyr spread
tidyverse cheat sheet
spread r
pivot_wider
dplyr long to wide

I have a simple data frame in a tidy format:

  group variable               value
  <fct> <chr>                  <dbl>
1     fishers_here         100
1     money_per_fisher     2000
1     unnecessary_variable 10
2     fishers_here         140
2     money_per_fisher     8000
2     unnecessary_variable 304
3     fishers_here         10
3     money_per_fisher     9000
....

for each group I'd like to have the variable "total money in group" which is just fishers_here * money_per_fisher; basically I'd like it to look like this:

  group variable               value
  <fct> <chr>                  <dbl>
1     fishers_here         100
1     money_per_fisher     2000
1     unnecessary_variable 10
1     TOTAL_MONEY          200000

....

Is there a simple way to get this done with tidyverse? By simple I mean without having to filter, summarise, add the variable column back in and then join the two now separate dataframes.

You can spread, do the multiplication and then gather back up. Note I'm assuming that there is a typo in the group number in row 6 as I commented, where it should be group 2 instead of group 1. If that's not the case, then some additional cleaning steps are needed. You can also sort your resulting rows however you want (e.g. to put the rows for each group back together)

library(tidyverse)
tbl <- read_table2(
  "group variable               value
  1     fishers_here         100
1     money_per_fisher     2000
1     unnecessary_variable 10
2     fishers_here         140
2     money_per_fisher     8000
2     unnecessary_variable 304
3     fishers_here         10
3     money_per_fisher     9000"
)
tbl %>%
  spread(variable, value) %>%
  mutate(total_money_in_group = money_per_fisher * fishers_here) %>%
  gather(variable, value, -group)
#> # A tibble: 12 x 3
#>    group variable               value
#>    <dbl> <chr>                  <dbl>
#>  1     1 fishers_here             100
#>  2     2 fishers_here             140
#>  3     3 fishers_here              10
#>  4     1 money_per_fisher        2000
#>  5     2 money_per_fisher        8000
#>  6     3 money_per_fisher        9000
#>  7     1 unnecessary_variable      10
#>  8     2 unnecessary_variable     304
#>  9     3 unnecessary_variable      NA
#> 10     1 total_money_in_group  200000
#> 11     2 total_money_in_group 1120000
#> 12     3 total_money_in_group   90000

Created on 2019-02-04 by the reprex package (v0.2.1)

Manipulating, analyzing and exporting data with tidyverse, Add new columns to a data frame that are functions of existing columns with mutate Describe the concept of a wide and a long table format and for which purpose This is an “umbrella-package” that installs several packages useful for data  Although I found an answer every time, yet it was impossible to remember when I needed since I did not fully understand how transforming the dataset works. I prefer to use tidyverse package for this task, but I know that reshape package works as well. First, I create a dataset with 9 rows and 3 IDs in long format to take as an example.

An option would be to filter the 'money_per_fisher', 'fishers_here', grouped by 'group', summarise to get the prod of 'value', bind the rows with the original data and arrange by 'group'

library(tidyverse)
df1 %>%
   filter(variable %in% c('fishers_here', 'money_per_fisher')) %>%
   group_by(group) %>% 
   summarise(variable = "total_money_in_group", value = prod(value)) %>% 
   bind_rows(tbl, .) %>% 
   arrange(group)
# A tibble: 11 x 3
#   group variable               value
#   <int> <chr>                  <dbl>
# 1     1 fishers_here             100
# 2     1 money_per_fisher        2000
# 3     1 unnecessary_variable      10
# 4     1 total_money_in_group  200000
# 5     2 fishers_here             140
# 6     2 money_per_fisher        8000
# 7     2 unnecessary_variable     304
# 8     2 total_money_in_group 1120000
# 9     3 fishers_here              10
#10     3 money_per_fisher        9000
#11     3 total_money_in_group   90000
data
df1 <- structure(list(group = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L),
 variable = c("fishers_here", 
 "money_per_fisher", "unnecessary_variable", "fishers_here", "money_per_fisher", 
 "unnecessary_variable", "fishers_here", "money_per_fisher"), 
value = c(100L, 2000L, 10L, 140L, 8000L, 304L, 10L, 9000L
)), class = "data.frame", row.names = c(NA, -8L))

Tidying/reshaping tables using tidyr, Data tables come in different sizes and shape; they can be a very simple two column names_to : This is the name of the new column which will combine all column If a table is to be used for a visual assessment of the values, a long format Continuing with the df2.long dataframe, we can spread the long table back to a  summarise() creates a new data frame. It will have one (or more) rows for each combination of grouping variables; if there are no grouping variables, the output will have a single row summarising all observations in the input. It will contain one column for each grouping variable and one column for each of the summary statistics that you have specified. summarise() and summarize() are synonyms.

Based on your output I think this is a possible solution:

df %>% 
   group_by(group) %>% 
   summarise(value = prod(value))

Edit: If you want a column on the original dataset you can use mutate instead of summarise

Reshaping Your Data with tidyr · UC Business Analytics R , Objective: Reshaping wide format to long format Arguments: data: data frame key: column name representing new variable value: column Description: There may be a time in which we would like to combine the values of two variables. Add new columns to a data frame that are functions of existing columns with mutate. Use the split-apply-combine concept for data analysis. Use summarize, group_by, and count to split a data frame into groups of observations, apply summary statistics for each group, and then combine the results.

Expand data frame to include all possible combinations of values , expand() generates all combination of variables found in a dataset. Specification of columns to expand. You can combine the two forms. For example, expand(df, nesting(school_id, student_id), date) would produce a row When used with factors, expand() uses the full set of levels, not just those that appear in the data. A data frame to pivot. cols <tidy-select> Columns to pivot into longer format. names_to: A string specifying the name of the column to create from the data stored in the column names of data. Can be a character vector, creating multiple columns, if names_sep or names_pattern is provided. In this case, there are two special values you can take

Summarise each group to fewer rows, A data frame, to add multiple columns from a single expression. .groups. Experimental lifecycle Grouping structure of the result. "drop_last": dropping the last  A data frame. By default, the newly created columns have the shortest names needed to uniquely identify the output. To force inclusion of a name, even when not needed, name the input (see examples for details). Grouping variables. If applied on a grouped tibble, these operations are not applied to the grouping variables

Complete a data frame with missing combinations of data , Complete a data frame with missing combinations of data Turns implicit missing values into explicit missing values. You can combine the two forms. For example, expand(df, nesting(school_id, student_id), date) would produce a row for When used with factors, expand() uses the full set of levels, not just those that  arrange() order the rows of a data frame rows by the values of selected columns. Unlike other dplyr verbs, arrange() largely ignores grouping; you need to explicit mention grouping variables (or use by_group = TRUE) in order to group by them, and functions of variables are evaluated once per data frame, not once per group.

Comments
  • let me rephrase it for added clarity
  • No. The easiest would be to summarize and merge. None of the verbs other than the joins make it possible to add new rows. You could maybe use the do() but not sure how recommended that is any more.
  • Is there a typo in row 6, where it should be group 1? or are there actually duplicate rows
  • thanks for this; it won't work since it'll multiply over variables that are not of interest. Of course I could filter first, but I was hoping to avoid having to do all that work and then having to left_join the two dataframes later on.