How to filter a data set and calculate a new variable faster in R?

filter in r
dplyr filter
dplyr cheat sheet
r select rows containing string
subset in r
r select rows multiple conditions
r filter data frame multiple conditions
group by in r

I have a data set with values every minute and I want to calculate the average value for every hour. I have tried by using the group_by(), filter() and summarise() from dplyr package to reduce the data every hour. When I use only these functions I am able to get the mean value for every hour but only every month and I want it for each day.

> head(DF)
         datetime        pw        cu          year m  d hr min
1 2017-08-18 14:56:00 0.0630341 1.94065        2017 8 18 14  53
2 2017-08-18 14:57:00 0.0604653 1.86771        2017 8 18 14  57
3 2017-08-18 14:58:00 0.0601318 1.86596        2017 8 18 14  58
4 2017-08-18 14:59:00 0.0599276 1.83761        2017 8 18 14  59
5 2017-08-18 15:00:00 0.0598998 1.84177        2017 8 18 15   0

I had to use a for loop to reduce my table, I wrote the following to do it:

datetime <- c()
eg_bf <-c ()

for(i in 1:8760){  
    hour= start + 3600

    DF= DF %>% 
      filter(datetime >= start & datetime < hour) %>% 
      summarise(eg= mean(pw))

    datetime= append(datetime, start)
    eg_bf= append(eg_bf, DF$eg)

    start= hour
    }
new_DF= data.frame(datetime, eg_bf)

So. I was able to get my new data set with the mean value for every hour of the year.

  datetime             eg_bf
1 2018-01-01 00:00:00  0.025
2 2018-01-01 01:00:00  0.003
3 2018-01-01 02:00:00  0.002
4 2018-01-01 03:00:00  0.010
5 2018-01-01 04:00:00  0.015

The problem I'm facing is that It takes a lot of time to do it. The idea is to add this calculation to a shiny UI, so every time I make a change it must make the changes faster. Any idea how to improve this calculation?

you can try this. use make_date from the lubridate package to make a new date_time column using the year , month, day and hour columns of your dataset. Then group and summarise on the new column

library(dplyr)
library(lubridate)
 df %>% 
   mutate(date_time = make_datetime(year, m, d, hr)) %>%  
   group_by(date_time) %>% 
   summarise(eg_bf = mean(pw))

6 Efficient data carpentry, tibble is a package that defines a new data frame class for R, the tbl_df . When printing a tibble diff to screen, only the first ten rows are displayed. perspective: it is usually faster to run analysis and plotting commands on tidy data. set as - religion , was used to remove the religion variable from the gathering, ensuring� The filter() verb helps to keep the observations following a criteria. The filter() works exactly like select(), you pass the data frame first and then a condition separated by a comma: filter(df, condition) arguments: - df: dataset used to filter the data - condition: Condition used to filter the data One criteria

@Adam Gruer's answer provides a nice solution for the date variable that should solve your question. The calculation of the mean per hour does work with just dplyr, though:

df %>%
  group_by(year, m, d, hr) %>%
  summarise(test = mean(pw))

# A tibble: 2 x 5
# Groups:   year, m, d [?]
   year     m     d    hr   test
  <int> <int> <int> <int>  <dbl>
1  2017     8    18    14 0.0609
2  2017     8    18    15 0.0599

You said in your question:

When I use only these functions I am able to get the mean value for every hour but only every month and I want it for each day.

What did you do differently?

5 Data transformation, Often you'll need to create some new variables or summaries, or maybe you just want to using the dplyr package and a new dataset on flights departing New York City in 2013. dplyr executes the filtering operation and returns a new data frame. dplyr functions If you want to determine if a value is missing, use is.na() : . I have a data set with values every minute and I want to calculate the average value for every hour. I have tried by using the group_by(), filter() and summarise() from dplyr package to reduce the data every hour. When I use only these functions I am able to get the mean value for every hour but only every month and I want it for each day.

Even if you've found your answer, I believe this is worth mentioning:

If you're working with a lot of data and speed is an issue, then you might want ot see if you can use data.table instead of dplyr

You can see with a simple benchmarking how much faster data.table is:

library(dplyr)
library(lubridate)
library(data.table)
library(microbenchmark)
set.seed(123)

# dummy data, one year, one entry per minute
# first as data frame
DF <- data.frame(datetime = seq(as.POSIXct("2018-01-01 00:00:00"), 
                                as.POSIXct("2019-01-02 00:00:00"), 60),
                 pw = runif(527041)) %>% 
  mutate(year = year(datetime), m=month(datetime), 
         d=day(datetime), hour = hour(datetime))

# save it as a data.table
dt <- as.data.table(DF)

# transformation with dplyr
f_dplyr <- function(){
  DF %>% 
    group_by(year, m, d, hour) %>% 
    summarize(eg_bf = mean(pw))
}


# transformation with data.table
f_datatable <- function() {
  dt[, mean(pw), by=.(year, m, d, hour)]
}

# benchmarking
microbenchmark(f_dplyr(), f_datatable())

# 
# Unit: milliseconds
#          expr       min        lq     mean   median       uq      max neval cld
#     f_dplyr() 41.240235 44.075019 46.85497 45.64998 47.95968 76.73714   100   b
# f_datatable()  9.081295  9.712694 12.53998 10.55697 11.33933 41.85217   100  a

check out this post it tells a lot data.table vs dplyr: can one do something well the other can't or does poorly?

filter: Subset rows using column values in dplyr: A Grammar of Data , The filter() function is used to subset a data frame, retaining all rows that satisfy your that return a logical value, and are defined in terms of the variables in . data . of .data , applying the expressions in to the column values to determine which data set), keeping only the rows with mass greater than this global average. Filter data. Even if a filter has been specified it will be ignored for (most) functions available in Data > Transform. To create a new dataset based on a filter navigate to the Data > View tab and click the Store button. Alternatively, to create a new dataset based on a filter, select Split data > Holdout sample from the Transformation type

As I understood you have a data frame of 365 * 24 * 60 rows. The code below returns the result instantly. The outcome is mean(pw) grouped by every hour of the year.

remove(list = ls())

library(dplyr)
library(lubridate)
library(purrr)
library(tibble)

date_time <- seq.POSIXt(
    as.POSIXct("2018-01-01"),
    as.POSIXct("2019-01-01"),
    by = "1 min"
)

n <- length(date_time)

data <- tibble(
    date_time = date_time,
    pw = runif(n),
    cu = runif(n),
    ye = year(date_time),
    mo = month(date_time),
    da = day(date_time),
    hr = hour(date_time)
)

grouped <- data %>% 
    group_by(
        ye, mo, da, hr
    ) %>% 
    summarise(
        mean_pw = mean(pw)
    )


dplyr: an R package for fast and easy data manipulation, Whether you're brand new to R or a long time user, you need to check out certain variables, filtering on certain conditions, deriving new variables from To keep matters simple, we'll use a data set that comes with R called ToothGrowth. Let's say I want to calculate the mean length of teeth (len) for each� This tutorial describes how to compute and add new variables to a data frame in R. You will learn the following R functions from the dplyr R package: mutate(): compute and add new variables into a data table. It preserves existing variables. transmute(): compute new columns but drop existing variables.

R Dplyr Tutorial : Data Manipulation (50 Examples), Base R Functions; select( ) Function; rename( ) Function; filter( ) Function; summarise( ) Function; arrange() mutate(), Creating New Variables, COLUMN ALIAS. dplyr vs. Base R Functions. dplyr functions process faster than base R functions. This dataset contains 51 observations (rows) and 16 variables ( columns). Filtering Data. Previous: analyzing data. Sometimes you only want to work with a subset of your data. With the crunch package, you can both filter the views of data you work with in your R session and manage the filters that you and your collaborators see in the web application.

filter function, Logical predicates defined in terms of the variables in .data . to optimise filtering optimisation on grouped datasets that don't need grouped calculations. For this reason, filtering is often considerably faster on ungroup() ed data. New example Use markdown to format your example R code blocks are runnable and � filter Filter the data. arrange Sort the data, by size for continuous variables, by date, or alphabetically. group_by Group the data by a categorical variable. summarize Summarize, or aggregate (for each group if following group_by). Often used in conjunction with functions including: mean(x) Calculate the mean, or average, for variable x.

A quick and dirty guide to the dplyr filter function, In this blog post, we'll explain how to use the dplyr filter function. Create new variables; Sort data; Summarise data (i.e. calculating summary statistics); Select There are several ways to subset your data in R. dplyr also has a set of helper functions, so there's more than these 5 tools, but these 5 are the� As person who works with data, one of the most exciting activities is to explore a fresh new dataset. You’re looking to understand what variables you have, how many records the data set contains, how many missing values, what is the variable structure, what are the variable relationships and more.

Comments
  • Just what I was looking for. Thank you!
  • Yes, the problem was that I used filter() for the year and then I grouped by month or hr. I was not including the m,d and hr in group_by()