Removal of adjacent duplicates by row - [R]

remove duplicate columns in r
r remove duplicate rows dplyr
r remove consecutive duplicates
r remove duplicate rows based on two columns
r find duplicate rows
duplicate rows in r
r duplicate rows based on one column
extract duplicate rows in r

I have a data frame where each row represents interaction data per person.

actions = read.table('C:/Users/Desktop/actions.csv', header = F, sep = ',', na.strings = '', stringsAsFactors = F)

Each person can have one, or more of the following interactions:

eat, sleep, walk, jump, hop, wake, run

The action lengths being recorded for each person may differ as below:

P1: eat,  sleep, sleep, sleep
P2: wake, walk,  eat,   walk, walk, jump, jump, run, run
P3: wake, eat,   walk,  jump, run,  sleep

To make the lengths equal, I have NA padding at the end:

P1: eat,  sleep, sleep, sleep, NA,   NA,    NA,   NA,  NA
P2: wake, walk,  eat,   walk,  walk, jump,  jump, run, run
P3: wake, eat,   walk,  jump,  run,  sleep, NA,   NA,  NA

Now, my requirement is to update the per person entries (row wise data), so that no two consecutive entries are duplicates. It is very important to maintain the order. My required output is:

P1: eat,  sleep, NA,   NA,   NA,   NA,    NA,   NA,  NA
P2: wake, walk,  eat,  walk, jump, run,   NA,   NA,  NA 
P3: wake, eat,   walk, jump, run,  sleep, NA,   NA,  NA

The column names are by default V1, V2, V3 .... Vn where

n = maximum length of interactions string 

In the above example P2 has maximum length; so n = 9. So total columns in the above example are from V1-V9.

The output for the

dput(actions)

structure(list(V1 = c("S", "C", "R"), V2 = c("C", "C", "R"), 
V3 = c("R", "C", "R"), V4 = c("S", NA, "R"), V5 = c("C", 
NA, "R"), V6 = c("R", NA, NA), V7 = c("S", NA, NA), V8 = c("C", 
NA, NA), V9 = c("R", NA, NA)), class = "data.frame", row.names = c(NA,-3L))

The following question: Removing Only Adjacent Duplicates in Data Frame in R is bit similar to mine, however, there are several differences. I am unable to solve my problem even by incorporating the code from the above question.

Any suggestions on this would be highly appreciated!

Removing Only Adjacent Duplicates in Data Frame in R, However, there are some duplicates that I would need to remove. In particular, I only want to remove row-adjacent duplicates, but keep the rest. Distinct function in R is used to remove duplicate rows in R using Dplyr package. Dplyr package in R is provided with distinct() function which eliminate duplicates rows with single variable or with multiple variable.

Here's a simple way using base R. I have simply created a function that will replace consecutive duplicates with NA and rearrange the new row in desired order -

# function to check consecutive duplicates
ccd <- function(x) {
  # first value can never be duplicate so initiating to 0
  test <- c(0, sapply(1:(length(x)-1), function(i) anyDuplicated(x[i:(i+1)])))
  x[test > 0] <- NA_character_
  x[order(test)]
}

# Original df from dput
> df
  V1 V2 V3   V4   V5   V6   V7   V8   V9
1  S  C  R    S    C    R    S    C    R
2  C  C  C <NA> <NA> <NA> <NA> <NA> <NA>
3  R  R  R    R    R <NA> <NA> <NA> <NA>

for(r in 1:nrow(df)) {
  df[r, ] <- ccd(as.character(df[r, ]))
}

> df
  V1   V2   V3   V4   V5   V6   V7   V8   V9
1  S    C    R    S    C    R    S    C    R
2  C <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
3  R <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>

For the demo-ed example in post -

df <- read.csv(
text=gsub(" +", "", "P1, eat,  sleep, sleep, sleep, NA,   NA,    NA,   NA,  NA
P2, wake, walk,  eat,   walk,  walk, jump,  jump, run, run
                         P3, wake, eat,   walk,  jump,  run,  sleep, NA,   NA,  NA"), 
               header = FALSE, stringsAsFactors = FALSE)[, -1]

> df
    V2    V3    V4    V5   V6    V7   V8   V9  V10
1  eat sleep sleep sleep <NA>  <NA> <NA> <NA> <NA>
2 wake  walk   eat  walk walk  jump jump  run  run
3 wake   eat  walk  jump  run sleep <NA> <NA> <NA>

for(r in 1:nrow(df)) {
  df[r, ] <- ccd(as.character(df[r, ]))
}

> df
    V2    V3   V4   V5   V6    V7   V8   V9  V10
1  eat sleep <NA> <NA> <NA>  <NA> <NA> <NA> <NA>
2 wake  walk  eat walk jump   run <NA> <NA> <NA>
3 wake   eat walk jump  run sleep <NA> <NA> <NA>

How to Remove Duplicate Data in R, A very useful application of subsetting data is to find and remove duplicate R automatically checks the observations (meaning, it treats every row as a value). A very useful application of subsetting data is to find and remove duplicate values. R has a useful function, duplicated(), that finds duplicate values and returns a logical vector that tells you whether the specific value is a duplicate of a previous value. This means that for duplicated values, duplicated() returns FALSE for the first …

A combination of dplyr, reshape2 and base R. First, it identifies the desired duplicates and replace them with NA. Then, it shifts the non-NA values to the left.

as.data.frame(t(apply(df %>%
          gather(var, val, -V1) %>% 
          group_by(V1) %>% 
          mutate(val2 = ifelse(val == lag(val), NA, val),
                 val2 = ifelse(var == "V2", paste(val), val2)) %>% 
          dcast(V1~var, value.var = "val2"), 1, function(x) c(x[!is.na(x)], x[is.na(x)]))))

  V1   V2    V3   V4   V5   V6    V7   V8   V9  V10
1 P1  eat sleep <NA> <NA> <NA>  <NA> <NA> <NA> <NA>
2 P2 wake  walk  eat walk jump   run <NA> <NA> <NA>
3 P3 wake   eat walk jump  run sleep <NA> <NA> <NA>

Data (using the code from @Shree):

df <- read.csv(text = gsub(" +", "", "P1, eat,  sleep, sleep, sleep, NA,   NA,    NA,   NA,  NA
            P2, wake, walk,  eat,   walk,  walk, jump,  jump, run, run
            P3, wake, eat,   walk,  jump,  run,  sleep, NA,   NA,  NA"), 
               header = FALSE, stringsAsFactors = FALSE)

Identify and Remove Duplicate Data in R, Extract unique elements; Remove duplicate rows in a data frame; Summary. Required packages. Load the tidyverse packages, which include dplyr  Remove Duplicate Rows In Excel This page describes how to remove duplicate rows in Excel, using three different methods. If you want to remove duplicate cells (rather than entire rows of data), you may find the Remove Duplicate Cells page more straightforward.

How to remove entire rows of duplicates based on two different , How do I remove entire rows of duplicates based on two different columns in a R dataframe? Ad by DuckDuckGo. the lesson “Identify and Remove Duplicate Data in R” was extremely helpful for my task, Question: two dataframes like “iris”, say iris for Country A and B, the dataframes are quite large, up to 1 mio rows and > 10 columns, I’d like to check, whether a row in B contains the same input in A. E.g. in ‘iris’ row 102 == 143;

R removing duplicates with conditions - General, the higher visit is missing, or delete one whole row including all other values of a person if there is a duplicate but no adjacent visit is missing. The following approach can be followed to remove duplicates in O(N) time:. Start from the leftmost character and remove duplicates at left corner if there are any. The first character must be different from its adjacent now.

Recursively remove all adjacent duplicates, Given a string, recursively remove adjacent duplicate characters from the string. The output string should not have any adjacent duplicates. See following  redundantDataFrame is the dataframe with duplicate rows. newDataFrame is the dataframe with all the duplicate rows removed. unique is the keyword. Example – Remove Duplicate Rows in R Dataframe. In this example, we will create a dataframe with a duplicate row of another. We shall use unique function to remove these duplicate rows.

Comments
  • Can you post sample data in dput format? Please edit the question with the output of dput(df). Or, if it is too big with the output of dput(head(df, 20)). (df is the name of your dataset.)
  • Contributors: the "two consecutive" appears to be paramount.
  • @hrbrmstr Your solution has disappeared. It seemed to be leading me in the right direction...
  • it's not, though. you'll get errant results with it since it doesn't handle the "only two consecutive" properly
  • I think I did not convey properly: sleep sleep sleep sleep should result in 'sleep'; not 'sleep sleep' as I do not want duplicates.
  • didn't see the "V" note in your edited question before finishing
  • Thanks, the solution is mostly working and it is amazing!! Just a little trouble, i think if there are more than 3 or more duplicate entries in a row; the code produces 2 instances in the output. An entry which had 'sleep sleep sleep NA NA NA NA NA NA ' was reduced to 'sleep sleep NA NA NA NA NA NA NA' And in another case, 'walk walk walk walk walk NA NA NA NA' came down to 'walk walk NA NA NA NA NA NA NA'
  • Oh. So the "two consecutive" is paramount. i.e. it's not just "all consecutive"?
  • Thanks! Your code works fine except at the first occurrence at the start of a line.
  • lemme poke at that. i thought that might be an edge case