Counting unequal elements in-between equal elements in R df column

r check if two columns are equal
r data table compare columns
r compare columns of two data frames
compare two columns in r
compare column names in r
find row where two columns match in r
compare two rows in r
compare multiple columns in r

I'm quite new to R and while I have done some data wrangling with it, I am completely at a loss on how to tackle this problem. Google and SO search didn't get me anywhere so far. Should this be a duplicate, I'm sorry, then please point me to the right solution.

I have a df with 2 columns called id and seq. like so

set.seed(12)
id <- rep(c(1:2),10)
seq<-sample(c(1:4),20,replace=T)
df <- data.frame(id,seq)
df <- df[order(df$id),]

    id seq  
 1   1   1
 3   1   4
 5   1   1
 7   1   1
 9   1   1
 11  1   2
 13  1   2
 15  1   2
 17  1   2
 19  1   3
 2   2   4
 4   2   2
 6   2   1
 8   2   3
 10  2   1
 12  2   4
 14  2   2
 16  2   2
 18  2   3
 20  2   1

I would need to count the number of unequal elements in between the equal elements in the seq column e.g. how many elements are between 1 and 1 or 3 and 3 etc. The first instance of the element should be NaN because there is no element before this to count.If the next element is identical it should just code 0, as there is no unequal element in-between e.g. 1 and 1. The results should be written out in a new column e.g. delay.

One catch is that this process would have to start again once a new id starts in the id column (here: 1 & 2).

This is what I would love to have as output:

     id seq   delay 
 1   1   1     NA
 3   1   4     NA
 5   1   1     1
 7   1   1     0
 9   1   1     0
 11  1   2     NA
 13  1   2     0
 15  1   2     0
 17  1   2     0
 19  1   3     NA
 2   2   4     NA
 4   2   2     NA
 6   2   1     NA
 8   2   3     NA
 10  2   1     1
 12  2   4     4
 14  2   2     4
 16  2   2     0
 18  2   3     4
 20  2   1     4

I really hope someone might be able to help me figure this out and allow me learn more about this.

Here is a possibility using a custom function within a dplyr chain

my.function <- function(x) {
    ret <- rep(NA, length(x))
    for (i in 2:length(x)) {
        for (j in (i-1):1) {
            if (x[j] == x[i]) {
                ret[i] = i - j - 1
                break
            }
        }
    }
    return(ret)
}

library(dplyr)
df %>%
    group_by(id) %>%
    mutate(delay = my.function(seq))
## A tibble: 20 x 3
## Groups:   id [2]
#      id   seq delay
#   <int> <int> <dbl>
# 1     1     1   NA
# 2     1     4   NA
# 3     1     1    1.
# 4     1     1    0.
# 5     1     1    0.
# 6     1     2   NA
# 7     1     2    0.
# 8     1     2    0.
# 9     1     2    0.
#10     1     3   NA
#11     2     4   NA
#12     2     2   NA
#13     2     1   NA
#14     2     3   NA
#15     2     1    1.
#16     2     4    4.
#17     2     2    4.
#18     2     2    0.
#19     2     3    4.
#20     2     1    4.    

Some further explanations:

  1. We group rows by id and then apply my.function to entries in column seq. This ensures that we treat rows with different ids separately.

  2. my.function takes a vector of numeric entries, checks for previous equal entries, and returns the distance between the current and previous equal entry minus one (i.e. it counts the number of elements in between).

  3. my.function uses two for loops but this should be fast because we don't dynamically grow any vectors (ret is pre-allocated at the beginning of my.function) and we break the inner loop as soon as we encounter an equal element.

The comparedf function, Number of variables compared with some values unequal, 2 It is possible to change which column names are considered “the same variable”. Passing a single character as an element this vector will replace that character a data.​frame of by-variables and row numbers of observations not shared between datasets. What he does is to use a nested loop. The inner loop runs for each data frame over each column name. It basically takes each column name and the correponding element [i, j] from the data frame ( myList[[i]] ) and writes it into an empty data frame (dat). Thereby a new column that is named just like the column from the list element data frame is

A simple dplyr solution:

df %>%
  mutate(row = 1:n()) %>%
  group_by(id, seq) %>%
  mutate(delay = row - lag(row) - 1) %>%
  select(-row)
# # A tibble: 20 x 3
# # Groups:   id, seq [8]
#       id   seq delay
#    <int> <int> <dbl>
#  1     1     1    NA
#  2     1     4    NA
#  3     1     1     1
#  4     1     1     0
#  5     1     1     0
#  6     1     2    NA
#  7     1     2     0
#  8     1     2     0
#  9     1     2     0
# 10     1     3    NA
# 11     2     4    NA
# 12     2     2    NA
# 13     2     1    NA
# 14     2     3    NA
# 15     2     1     1
# 16     2     4     4
# 17     2     2     4
# 18     2     2     0
# 19     2     3     4
# 20     2     1     4

Find mismatch in two columns in a data frame in R, If I understand your question correctly, you want everything in df$SNP1 that is not in df$SNP2. Small example using two vectors: a <-c('a','b','c'  Here’s a feature of dplyr that occasionally bites me (most recently while making these graphs). It’s about to change mostly for the better, but is also likely to bite me again in the future. If you want to follow along there’s a GitHub repo with the necessary code and data. Say we have a data frame or tibble and we want to get a frequency table or set of counts out of it. In this case

Try:

set.seed(12)
id <- rep(c(1:2),10)
seq<-sample(c(1:4),20,replace=T)
df <- data.frame(id,seq)
df <- df[order(df$id),]
df

get_lead <- function(x) {
  x <- as.character(x)
  l <- list(unique(x))
  res <- rep(NA, length(x))
  for (i in seq_along(x)) {
    if (!is.null(l[[x[i] ]])) {
      res[i] <- (i - l[[x[i] ]] - 1)
    }
    l[[x[i] ]] <- i
  }
  res
}
df$delay <- unlist(lapply(split(df$seq, df$id), get_lead))
df  

# id seq delay
#1   1   1    NA
#3   1   4    NA
#5   1   1     1
#7   1   1     0
#9   1   1     0
#11  1   2    NA
#13  1   2     0
#15  1   2     0
#17  1   2     0
#19  1   3    NA
#2   2   4    NA
#4   2   2    NA
#6   2   1    NA
#8   2   3    NA
#10  2   1     1
#12  2   4     4
#14  2   2     4
#16  2   2     0
#18  2   3     4
#20  2   1     4

15 Easy Solutions To Your Data Frame Problems In R, Each column needs to consist of values of the same type, since they are data It's almost similar to having a single spreadsheet with elements that all know more about the differences in possibilities between the stack() and  count <- function(x, n){ length((which(x == n))) } perc <- function(x, n){ 100*length((which(x == n))) / length(x) } Note the syntax involved in setting up a function in R. Now let’s use the count function to count the threes in the vector b. count(b, 3) [1] 4. perc(b, 4) [1] 7.692308

Here is approach: -write function to find which row is start one for index - write function which calculate number of different numbers versus the latest repetetive - apply function to all rows and assign to variable delay

Indstart <- function(j,df){
  ind_start <- min(which(df[1:j,1]==df[j,1]))
}

difval <- function( j, df){
  i <- Indstart(j, df)
  pos_j_pr <- ifelse(length(which(df[i:(j-1),2]==df[j,2]))>0, max(which(df[i:(j-1),2]==df[j,2])) + i-1, 0)
  non_rep_num <- ifelse(pos_j_pr>0, sum(df[pos_j_pr:j,2] != df[j,2]), "NA")
  return(non_rep_num)
}

for (j in 1:length(df[,1])){
  df$delay[j] <- difval(j,df)
}

5 Data Structures, All elements of a vector must have the same type or, in R terminology, the same mode. A critical difference between a vector and a list can be summed up this way: Those vectors and factors are the columns of the data frame. the mysterious Recycling Rule that governs how R handles vectors of unequal length​. If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a DataFrame. A str specifies the level name. numeric_only bool, default False. Include only float, int or boolean data. Returns Series or DataFrame. For each column/row the number of non-NA/null entries. If level is specified returns a DataFrame.

Basic Statistical Analysis Using the R Statistical Package, In R, these values can be represented as a column vector (as a data set, The '[​1]' the R gives at the start of the line is a counter – this line starts with the first An R dataframe can be viewed and edited as a spreadsheet within R using the R will calculate a confidence interval for the difference between two proportions;​  list.df.object. A list of dataframes with equal number/named of columns. list.vector.object. A list of dataframes with equal number/named of columns. col3. The name of the third column (list_vect2df). … Further arguments passed to vect2df. mat. A matrix of counts. nm. A character vector of names to assign to the list. use.names. logical.

rep: Replicate Elements of Vectors and Lists, rep replicates the values in x. It is a generic function, and the (internal) default method is described here. rep.int and rep_len are faster simplified versions for two  These functions calculate count/sum/average/etc. on values that meet a criterion that you specify. apply_if_* apply custom functions. There are different flavors of these functions: *_if work on entire dataset/matrix/vector, *_row_if works on each row and *_col_if works on each column.

lapply: Apply a Function over a List or Vector, Simplification in sapply is only attempted if X has length greater than zero and if the return values from all elements of X are all of the same (positive) length. If the​  If you want to compare two columns and count matches in corresponding rows, you can use the SUMPRODUCT function with a simple comparison of the two ranges. For example, if you have values in B5:B11 and C5:C11 and you want to count any differences, you can use this formula: =SUMPRODUCT(--( B5:B11 = C5:C11 )) How this formula works.

Comments
  • here's an uncreative base translation: df$delay <- ave(df$seq, df$id,FUN= function(x) ave(seq_along(x), x, FUN = function(y) y - c(NA, y[-length(y)]) -1))
  • Thank you for your solution! This also worked, but as a relative beginner in R the first solution was a bit easier to understand. Thus, I accepted the first as my answer, though your function worked just as well :-)