Need new method for identifying consecutive observations by specified intervals in r

count consecutive days in r
r count consecutive values
imputets: time series missing value imputation in r
time series imputation python
time series with missing data in r
time series imputation in r
multivariate time series imputation
imputets r

I have something like the following:

  ID start value want
1  1   1.4   1.5    3
2  1   1.4   3.3    3
3  1   1.4   4.2    3
4  2   3.4   5.5    2
5  2   3.4   6.5    2
6  2   3.4   6.7    2
7  2   3.4   6.9    2

I want to count whether consecutive value observations happen one interval after another, where intervals are start + 1, within ID's. Essentially I am ONLY interested in whether observations come right after the observation before sequentially but in the NEXT interval, regardless of what interval that observation happens in.

For instance I was able to solve this by creating the interval columns (one after start, two after start etc), then using ifelse statements for whether value fell between one of the interval columns (within_ vars as 1/0), and then returning the max sum of any possible left to right diagonal (want; gives the max times observation happened one interval after another) like so:

  ID start one_after two_after three_after four_after value want within_start_one within_one_two within_two_three within_three_four
1  1   1.4       2.4       3.4         4.4        5.4   1.5    3                1              0                0                 0
2  1   1.4       2.4       3.4         4.4        5.4   3.3    3                0              1                0                 0
3  1   1.4       2.4       3.4         4.4        5.4   4.2    3                0              0                1                 0
4  2   3.4       4.4       5.4         6.4        7.4   5.5    2                0              0                1                 0
5  2   3.4       4.4       5.4         6.4        7.4   6.5    2                0              0                0                 1
6  2   3.4       4.4       5.4         6.4        7.4   6.7    2                0              0                0                 1
7  2   3.4       4.4       5.4         6.4        7.4   6.9    2                0              0                0                 1

BUT the problem is that I have a ton of data, and it simply doesn't run. I am all out of ideas and would appreciate the help.

Data to play around with:

df<-data.frame(ID=c(1, 1, 1,2,2,2,2),
start=c(1.4, 1.4, 1.4, 3.4,3.4,3.4,3.4),
value=c(1.5, 3.3, 4.2, 5.5, 6.5, 6.7, 6.9),
want=c(3,3,3,2,2,2,2))

What I've tried and works somewhat for small data (but unfortunately sums the runs of diagonals rather than returning the max run of diagonals):

df1<-data.frame(ID=c(1, 1, 1,2,2,2,2),
start=c(1.4, 1.4, 1.4, 3.4,3.4,3.4,3.4),
value=c(1.5, 3.3, 4.2, 5.5, 6.5, 6.7, 6.9))



df1<-map(1*1:4, ~ df1 %>%
                   transmute(!!paste( "intervals", sep = "_") := start + .x)) %>%
  bind_cols(df1, .)


df<-df1%>%          mutate  (   within_start_one        =   ifelse( value   >= start &  value   <   intervals   ,   1,  0   ))
df<-df%>%           mutate  (   within_one_two      =   ifelse( value   >=  intervals   &   value   <   intervals1  ,   1   ,   0   ))
df<-df%>%           mutate  (   within_two_three        =   ifelse( value   >= intervals1 & value   <   intervals2  ,   1,  0   ))
df<-df%>%           mutate  (   within_three_four       =   ifelse( value   >=  intervals2  &   value   <   intervals3  ,   1   ,   0   ))

max_diag <- function(x) max(sapply(split(as.matrix(x), row(x) - col(x)), sum))
View(df_results)
df_results<-df%>%select(ID, 9:12)

df_results1<-merge(df_results, stack(by(df_results[-1], df_results$ID, max_diag)), by.x = "ID", by.y = "ind")
df_results1

Is this what you want? How about this?

library(tidyverse)

df <- tibble(ID = c(1,1,1,2,2,2,2),
             start = c(1.4,1.4,1.4,3.4,3.4,3.4,3.4),
             value = c(1.5,3.3,4.2,5.5,6.5,6.7,6.9),
             want = c(3,3,3,2,2,2,2))

df %>%
    group_by(ID) %>%
    mutate(
        interval = floor(value - start) + 1,
        consecutive = interval == lag(interval) + 1,
        consecutive = if_else(is.na(consecutive), lead(consecutive), consecutive),
        cumulated = sum(consecutive)
        )
#> # A tibble: 7 x 7
#> # Groups:   ID [2]
#>      ID start value  want interval consecutive cumulated
#>   <dbl> <dbl> <dbl> <dbl>    <dbl> <lgl>           <int>
#> 1     1   1.4   1.5     3        1 TRUE                3
#> 2     1   1.4   3.3     3        2 TRUE                3
#> 3     1   1.4   4.2     3        3 TRUE                3
#> 4     2   3.4   5.5     2        3 TRUE                2
#> 5     2   3.4   6.5     2        4 TRUE                2
#> 6     2   3.4   6.7     2        4 FALSE               2
#> 7     2   3.4   6.9     2        4 FALSE               2

Created on 2020-01-08 by the reprex package (v0.3.0)

Count consecutive dates : Rlanguage, I need to count consecutive days by ID , by Item. i have posted this on stack … At this point, i really do not care if it is dplyr or any other method. exclude UserID or ItemName to ask questions about a specific user or item items <- items[ order(UserID, ItemName, Date)][, days := c(0, New R tutorials web page: R CODER. Identifying the Input Time Intervals When the FROM= option is specified, observations are understood to refer to the whole time interval and not to a single time point. The ID values are interpreted as identifying the FROM= time interval containing the value. In addition, the widths of these input intervals are used by the OBSERVED= cases TOTAL

An option is to reset value to the start (such that the new start is 0 for all ID), then count the number of consecutive intervals. Here is a implementation of this idea using data.table:

DT[, want := {
    d <- trunc(value - start)
    r <- rle(cumsum(c(0L, diff(d)!=1L)))
    max(r$lengths)
}, ID][
    want==1L, want:=0L]

Another faster implementation of the above:

DT[, rr := rowid(rleid(ID, cumsum(c(0L, diff(trunc(value - start))!=1L))))][,
    want := max(rr), ID][
        want==1L, want:=0L]

output:

    ID start value want
 1:  1   1.4   1.5    3
 2:  1   1.4   3.3    3
 3:  1   1.4   4.2    3
 4:  2   3.4   5.5    2
 5:  2   3.4   6.5    2
 6:  2   3.4   6.7    2
 7:  2   3.4   6.9    2
 8:  3   1.0   1.5    2
 9:  3   1.0   2.5    2
10:  3   1.0   6.5    2
11:  3   1.0   7.5    2
12:  4   1.0   1.5    0

data:

library(data.table)
DT <- data.table(ID=c(1,1,1, 2,2,2,2, 3,3,3,3, 4),
    start=c(1.4,1.4,1.4, 3.4,3.4,3.4,3.4, 1,1,1,1, 1),
    value=c(1.5,3.3,4.2, 5.5,6.5,6.7,6.9, 1.5,2.5,6.5,7.5, 1.5))

timings:

set.seed(0L)
nr <- 1e6
nid <- nr/4
DT <- data.table(ID=sample(nid, nr, TRUE))[,
    c("start", "value") := .(runif(1L, 0, 5), runif(.N, 5, 10)),
    ID]
setorder(DT, ID, start, value)

system.time({
    DT[, d := trunc(value - start)][, want := {
            r <- rle(cumsum(c(0L, diff(d)!=1L)))
            max(r$lengths)
        }, ID][
            want==1L, want:=0L]
})
#   user  system elapsed 
#   6.80    0.03    6.85 

system.time({
    DT2[, rr := rowid(rleid(ID, cumsum(c(0L, diff(trunc(value - start))!=1L))))][,
        want := max(rr), ID][
            want==1L, want:=0L]
})
#   user  system elapsed 
#   0.22    0.03    0.24 

[PDF] imputeTS: Time Series Missing Value Imputation in R, short overview about univariate time series imputation in R. methods require missing values to be replaced with reasonable values up-front sections of the data, longest series of consecutive NAs and occurrence of Multiple observations for time intervals are grouped together and represented New York: Wiley, 1987. I am very new to SAS and would very much appreciate some guidance. I have created a multi-thousand observation dataset within which I need to identify the occurrence of 4 consecutive, monthly instances - by two fields (person, clerk). That is: "Clerks that connected with Persons 4 months consecutively, over a period of 13 months"

Maybe somthing like this

library(tidyverse)

df_example <- data.table::fread("ID start value want
1  1   1.4   1.5    3
2  1   1.4   3.3    3
3  1   1.4   4.2    3
4  2   3.4   5.5    2
5  2   3.4   6.5    2
6  2   3.4   6.7    2
7  2   3.4   6.9    2")
#> Warning in data.table::fread("ID start value want\n1 1 1.4 1.5 3\n2 1 1.4 3.3
#> 3\n3 1 1.4 4.2 3\n4 2 3.4 5.5 2\n5 2 3.4 6.5 2\n6 2 3.4 6.7 2\n7 2 3.4 6.9 2"):
#> Detected 4 column names but the data has 5 columns (i.e. invalid file). Added 1
#> extra default column name for the first column which is guessed to be row names
#> or an index. Use setnames() afterwards if this guess is not correct, or fix the
#> file write command that created the file to create a valid file.

df_example %>% 
  select(-V1) %>% 
  as.data.frame() %>% 
  dput()
#> structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 2L, 2L), start = c(1.4, 
#> 1.4, 1.4, 3.4, 3.4, 3.4, 3.4), value = c(1.5, 3.3, 4.2, 5.5, 
#> 6.5, 6.7, 6.9), want = c(3L, 3L, 3L, 2L, 2L, 2L, 2L)), row.names = c(NA, 
#> -7L), class = "data.frame")

df_example <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 2L, 2L), start = c(1.4, 
                                                                           1.4, 1.4, 3.4, 3.4, 3.4, 3.4), value = c(1.5, 3.3, 4.2, 5.5, 
                                                                                                                    6.5, 6.7, 6.9), want = c(3L, 3L, 3L, 2L, 2L, 2L, 2L)), row.names = c(NA, 
                                                                                                                                                                                         -7L), class = "data.frame")

df_example %>%
  group_by(ID) %>% 
  mutate(row_numb = row_number(),
         current = value - start - row_numb,
         sum_if = if_else(current <1 & current > -1,1,0)) %>%
  mutate(want2 = sum(sum_if)) %>%
  select(-sum_if,-current,-row_numb)
#> # A tibble: 7 x 5
#> # Groups:   ID [2]
#>      ID start value  want want2
#>   <int> <dbl> <dbl> <int> <dbl>
#> 1     1   1.4   1.5     3     3
#> 2     1   1.4   3.3     3     3
#> 3     1   1.4   4.2     3     3
#> 4     2   3.4   5.5     2     2
#> 5     2   3.4   6.5     2     2
#> 6     2   3.4   6.7     2     2
#> 7     2   3.4   6.9     2     2

Created on 2020-01-07 by the reprex package (v0.3.0)

[PDF] Practical Regression and Anova using R, It is relatively easy to program new methods in R . 2. R attempts to specify the number and spacing of bins given the size and distribution of the data but where n is the number of observations or cases in the dataset. want to make any confidence intervals or perform any hypothesis tests, we will need to do this. All intervals need to be the same length: Intervals can be from a few seconds long to a few minutes long. Check out this free printable pdf 'Interval Observation Form' . Note: Total observation time and length of intervals need to be the same each time that you observe.

Basic Statistical Analysis Using the R Statistical Package, R will use these object names to identify data, and so the same name cannot be used When variable names are specified as the first row of the imported Excel file, R In order to import a saved data set into R, R needs to know which directory (or The subset() function creates a new data frame, restricting observations to� Identifying the first and/or last observation within a group is often an important step in data management. For instance, a dataset may contain medical discharge records for patients in which you are interested in each patient’s earliest visit. You would then be interested in the first observation within a patient ID after sorting by date.

Chapter 2 R basics, To follow along you will therefore need access to R. We also recommend the use of an Note that we have not been specifying the argument x as such: In this dataset, each state is considered an observation and five variables are but you can still use the brackets method and instead of providing the variable name,� A prediction interval is a range of values that is likely to contain the value of a single new observation given specified settings of the predictors.For example, for a 95% prediction interval of [5 10], you can be 95% confident that the next new observation will fall within this range.

[PDF] Useful Stata Commands for Longitudinal Data Analysis, append: Observations with information on the same variables are stored separately. - merge: Different We have the first two SOEP person data sets ap. dta and bp.dta. • The same 5 From this we want to create a new variable ( month) telling us, in which month Example: Identifying specific person-years bysort id (year):� a method of observation in which researchers participate in or join the group or culture that they are observing case study a descriptive study that includes an intensive study of one person and allows an intensive examination of a single case, usually chosen for its interesting or unique characteristics

Comments
  • 6.5, 6.7 and 6.9 are all between 6.4-7.4, therefore within_three_four(the interval I created)==1 for all three of these bottom three rows. Then, the two consecutive values would be 5.5 to any of these (as they fall in the same interval), so want=2 (max number of consecutive observations or the sum of this diagonal going from 5.5 to 6.5). It is just how I conceptualized the data but other methods are welcome and desperately needed, as this method takes a ton of memory.
  • I have added the code. No, basically each row I want to ask: "value, do you fall between any of these intervals starting from start+1 and so on? if yes, great I will give you a 1 for that interval you fall in. THEN the next row: value, do you fall between any of these intervals starting from start+1 and so on? if yes great you'll get a 1 for that interval you fall in. THEN, if those 1's are diagonal from each other it means those two rows happened one after another. This is what I'm interested in. Hope that helps
  • but again, this method is not practical and takes a ton of memory, I'll see if the answers coming in work
  • only dplyr? or can we use data.table?
  • by all means use anything! I just need it to run and be valid
  • close but not quite, ID==2 here should have 2 cumulated (row 4 then row 5 is one interval after another)
  • currently running it, can you explain briefly the rationale for the interval variable youre making?
  • interval is the index of the interval after start that the value falls into. So 1.5 falls into the first interval after 1.4, 3.3 falls into the second interval after 1.4.
  • I'm running this and it appears to be good.. but any way I can return the max number of consecutive TRUEs in a row rather than the sum? for example if I had TRUE TRUE FALSE TRUE, I want to return 2 rather than 3
  • ironically I found a bug in my code I used for small amounts of data, and it was summing the number of diagonals (i.e. 11 straight, a break, and then 15 straight=26) were being added together. I got the max number of straight TRUEs in your post and it was more accurate than my original... thanks a bunch
  • this is an excellent solution but still don't have the memory to run.
  • >500 thousand rows
  • for simplicity I made start a numeric in this data when in reality it is a date var. I'm converting it to numeric and seeing if that helps
  • I think then what you want it’s just changing from sum to max(r$lengths)
  • actually it looks like your edit did the trick! And it's incredibly fast, I was shocked with the run time