Filtering several date columns in R

lubridate filter by date
filter in r
as.date in r
selectbydate in r
filter time series in r
lubridate r
subset time series r
select by date r

I have a data frame with 3 columns. (https://pastebin.com/DFqUuuDp)

The first two columns ("Time1","Time2") contain datetime data and have both the posixct format: "%Y-%m-%d %H:%M:%S".

So what i need ultimately is a subselect of rows, where for a particular time in Time1, only the rows are selected where Time2

  • is one day before Time1 (here one day is not always 24 hours; one day means the calendar day before. basically "yesterday")
  • and Time2 is < 12:00 Am.

A correct example:

+---------------------+----------------------+
|        Time1        |        Time2         |
+---------------------+----------------------+
| 2016-11-01 00:00:00 | 2016-10-31 00:00:00  |
+---------------------+----------------------+

A wrong example:

+---------------------+----------------------+
|        Time1        |        Time2         |
+---------------------+----------------------+
| 2016-11-01 00:00:00 | 2016-10-31 12:00:00  |
+---------------------+----------------------+

In the uploaded file, I added manually the third column ("Value") as a guidance for rows which i want to have filtered at the end. The rows with a "True" are interesting me.

I solved it with two for loops, but it is very slow going through large tables.

The question has been tagged data.table. In addition, the sample datasets provided by the OP are of class data.table. Therefore, I feel obliged to post a data.table solution:

library(data.table)
DT[as.IDate(Time1) - 1L == as.IDate(Time2) & hour(Time2) < 12]
                 Time1      Time2 Value
1: 2016-11-01 00:00:00 2016-10-31  TRUE
2: 2016-11-01 00:30:00 2016-10-31  TRUE
3: 2016-11-01 01:00:00 2016-10-31  TRUE
4: 2016-11-01 01:30:00 2016-10-31  TRUE
# check if result is correct
identical(DT[as.IDate(Time1) - 1L == as.IDate(Time2) & hour(Time2) < 12L],
          DT[Value == "TRUE"])
[1] TRUE

as.IDate() is a Date class with integer storage for fast sorting and grouping. So, we can use integer arithmetic to compute the previous day. hour() is also provided by the data.table package and returns the hours of the day as integer value.

Reproducible data

Data copied from the pastebin link provided by the OP on 2018-05-29 at 22:00 UTC but with the .internal.selfref pointer removed:

DT <- structure(
  list(
    Time1 = structure(c(1477958400, 1477958400, 1477958400,
                        1477958400, 1477958400, 1477958400, 1477958400, 1477960200, 1477960200,
                        1477960200, 1477960200, 1477960200, 1477960200, 1477960200, 1477962000,
                        1477962000, 1477962000, 1477962000, 1477962000, 1477962000, 1477962000,
                        1477963800, 1477963800, 1477963800, 1477963800, 1477963800, 1477963800,
                        1477963800), 
                      class = c("POSIXct", "POSIXt"), tzone = "UTC"),
    Time2 = structure(c(1477699200, 1477742400, 1477785600, 1477828800,
                        1477872000, 1477915200, 1477958400, 1477699200, 1477742400,
                        1477785600, 1477828800, 1477872000, 1477915200, 1477958400,
                        1477699200, 1477742400, 1477785600, 1477828800, 1477872000,
                        1477915200, 1477958400, 1477699200, 1477742400, 1477785600,
                        1477828800, 1477872000, 1477915200, 1477958400), 
                      class = c("POSIXct", "POSIXt"), tzone = "UTC"), 
    Value = c("FALSE", "FALSE", "FALSE",
              "FALSE", "TRUE", "FALSE", "FALSE", "FALSE", "FALSE", "FALSE",
              "FALSE", "TRUE", "FALSE", "FALSE", "FALSE", "FALSE", "FALSE",
              "FALSE", "TRUE", "FALSE", "FALSE", "FALSE", "FALSE", "FALSE",
              "FALSE", "TRUE", "FALSE", "FALSE")), 
  .Names = c("Time1", "Time2", "Value"), 
  row.names = c(NA, -28L), 
  class = c("data.table", "data.frame"))

R: How to filter/subset a sequence of dates, you could use subset. Generating your sample data: temp<- read.table(text="date sessions 2014-12-01 1932 2014-12-02 1828 2014-12-03  One of the most basic data wrangling tasks is filtering data. Starting from a large dataset, and reducing it to a smaller, more manageable dataset, based on some criteria. Think of filtering your sock drawer by color, and pulling out only the black socks. Whenever I need to filter in R, I turn to the dplyr filter function.

The following solution works. The solution uses data shared by OP

library(dplyr)
library(lubridate)

df %>%
  filter((as.Date(Time2)+days(1)) == as.Date(Time1) & format(Time2, "%H") < 12)


#                   Time1      Time2 Value
# 1   2016-11-01 00:00:00 2016-10-31  TRUE
# 2   2016-11-01 00:30:00 2016-10-31  TRUE
# 3   2016-11-01 01:00:00 2016-10-31  TRUE
# 4   2016-11-01 01:30:00 2016-10-31  TRUE
# 5   2016-11-01 02:00:00 2016-10-31  TRUE
# 6   2016-11-01 02:30:00 2016-10-31  TRUE
# 7   2016-11-01 03:00:00 2016-10-31  TRUE
# 8   2016-11-01 03:30:00 2016-10-31  TRUE
# 9   2016-11-01 04:00:00 2016-10-31  TRUE
# 10  2016-11-01 04:30:00 2016-10-31  TRUE
# so on

filter.date function, Selects data for a given visual meteor dataset and specified dates. filter.date returns data frame with the same number of columns as the argument data  Filtering a dataframe in R based on multiple Conditions [closed] Let df be the dataframe with at least three columns gender, age and bp. The idea behind

A solution using subset and lubridate package can be using an approach as:

  1. Add 1 day to Time2 and then check Time1 & Time2 belongs to same day.
  2. Format Time2 in HHMMSS format and then check it is less than 120000 (noon)

The code:

library(lubridate)

subset(df, format(Time1,"%Y%m%d") == format(Time2+days(1),"%Y%m%d") &
        as.integer(format(Time2, "%H%M%S")) < 120000 )

#                    Time1      Time2 Value
# 19   2016-11-01 00:00:00 2016-10-31  TRUE
# 39   2016-11-01 00:30:00 2016-10-31  TRUE
# 59   2016-11-01 01:00:00 2016-10-31  TRUE
# 79   2016-11-01 01:30:00 2016-10-31  TRUE
# 99   2016-11-01 02:00:00 2016-10-31  TRUE
# 119  2016-11-01 02:30:00 2016-10-31  TRUE
# 139  2016-11-01 03:00:00 2016-10-31  TRUE
# 159  2016-11-01 03:30:00 2016-10-31  TRUE
# 179  2016-11-01 04:00:00 2016-10-31  TRUE
# 
# so on

Note: Time2 for all rows as part of subset contains 00:00:00. Hence its not appearing in above print.

Data:

head(df, 20)
# Time1               Time2 Value
# 1  2016-11-01 2016-10-22 00:00:00 FALSE
# 2  2016-11-01 2016-10-22 12:00:00 FALSE
# 3  2016-11-01 2016-10-23 00:00:00 FALSE
# 4  2016-11-01 2016-10-23 12:00:00 FALSE
# 5  2016-11-01 2016-10-24 00:00:00 FALSE
# 6  2016-11-01 2016-10-24 12:00:00 FALSE
# 7  2016-11-01 2016-10-25 00:00:00 FALSE
# 8  2016-11-01 2016-10-25 12:00:00 FALSE
# 9  2016-11-01 2016-10-26 00:00:00 FALSE
# 10 2016-11-01 2016-10-26 12:00:00 FALSE
# 11 2016-11-01 2016-10-27 00:00:00 FALSE
# 12 2016-11-01 2016-10-27 12:00:00 FALSE
# 13 2016-11-01 2016-10-28 00:00:00 FALSE
# 14 2016-11-01 2016-10-28 12:00:00 FALSE
# 15 2016-11-01 2016-10-29 00:00:00 FALSE
# 16 2016-11-01 2016-10-29 12:00:00 FALSE
# 17 2016-11-01 2016-10-30 00:00:00 FALSE
# 18 2016-11-01 2016-10-30 12:00:00 FALSE
# 19 2016-11-01 2016-10-31 00:00:00  TRUE
# 20 2016-11-01 2016-10-31 12:00:00 FALSE

Use Tidyverse Pipes to Subset Time Series Data in R, Learn how to extract and plot data by a range of dates using pipes in R. Subset data using the dplyr filter() function. daily sum values and added some noData values to ensure you learn how to clean them! Also your data have a header (​the first row represents column names) so set header = TRUE. The filter () verb helps to keep the observations following a criteria. The filter () works exactly like select (), you pass the data frame first and then a condition separated by a comma: filter (df, condition) arguments: - df: dataset used to filter the data - condition: Condition used to filter the data. First of all, you can count the

Time-based filtering • tibbletime, filter_time() attempts to make filtering data frames by date much easier than dplyr​::filter() . In dplyr , if you wanted to get the dates for 2013 in the FB dataset, Note that the form of this is slightly different than the standard,  date sessions 1 2014-12-01 1932 2 2014-12-02 1828 3 2014-12-03 2349 4 2014-12-04 8192 5 2014-12-05 3188 6 2014-12-06 3277. And a need to subet/filter this, for example from "2014-12-05" to "2014-12-25". I now that you can create a sequence with the operator ":".

16 Dates and times, This chapter will show you how to work with dates and times in R. At first glance as you might need when filtering date/time data. ymd() is short and unambiguous: the individual components of the date-time spread across multiple columns. I need to filter rows for age below 10 and at the same time above 80. How can I do it in the simplest way? For one condition I can use filter (data.frame, age > 80) but I don't know how to do it for two conditions at the same time? @docendodiscimus, done sir fixed it now. – RavinderSingh13 Jun 26 '18 at 11:46.

How to select a range of dates in R?, I have a dataset that I want to extract certain date ranges to look at the differences monthly/seasonally in R. The data has a column of date values. Q: How to obtain​  Filter to keep last N days. One of the typical ways to filter date and time data is to filter the last ’n’ number of date and time periods. For example, if you are monitoring active users of your product or revenue of your business, you probably want to filter for the last 3 hours, 7 days or 3 months, and so on.

Comments
  • First off, the code from pastebin is giving an error of "Error: unexpected '<' in " -3000L". What did you do in the for loop? This doesn't seem like a problem that needs to be solved with loops. Have you tried using dplyr and just putting in the conditions in a filter? i.e. df %>% filter(time2-time1 <= 1 day & time2 < 12am)?
  • sorry, i reuploaded the file. It was a data.table before. Now i i have uploaded a data.frame. ---- I don't know how to formulate (time2-time1 <= 1 day & time2 < 12am). How would you code the 'time2-time1 <= 1 day' part?
  • @UDE_Student Your data is very big. Why dont you just share dput(head(df,20))? It will be easier for everyone.
  • yes you are right, i updated my initial post with the shorter version: pastebin.com/DFqUuuDp
  • I have added a possible solution in base-R using lubridate. Have a look,
  • Good use of as.IDate! Actually, data provided by OP was in data.frame when I took it. And I wanted to provide a base-R solution so that original row-names in result to give OP an idea which rows were selected.
  • great solution, i will give the accepted answer to this question because it does not use the lubridate package
  • Thank you. I just want to point out that I find lubridate to be a very handy package in many circumstances. However, it is not necessary to use the days() function or lubridate's date arithmetic here because we just need to subtract 1 to get the previous day. Both Date and IDate count days. The difference is that IDate uses integer as storage mode.
  • question regarding DT[as.IDate(Time1) - 1L == as.IDate(Time2) & hour(Time2) < 12]: I try to adjust this in that way, where i am not interested in Time1 being one calendar day before Time2; but Time1 is one OR more calendar days being before Time2. If i use >=, it is technically correct but it gives me more than one result, where i am interested in only "the latest possible single" Time2. Any ideas?
  • ok i solved it with: filtered_dt <- filtered_dt[Time1 > Time2, lapply(.SD, last), by = Time1]
  • Two minor comments. 1) You don't need to use $ in dplyr chain. Perhaps, you should try your solution on data shared by OP and check if it returns valid output.