Find duplicate rows in data frame based on multiple columns in r

r find duplicates in two columns
r remove duplicate rows based on one column
find duplicate rows in r
r remove duplicates based on two columns
remove duplicate columns in r
r remove duplicate rows dplyr
r unique multiple columns
finding duplicate values in a column in r

I have a data set that has some instances where for a given location at the same date and time the value is different. I am trying to create a subset data frame showing these instances. This is an example of what I mean:

I have looked at similar questions on SO but I can't seem to get what I want. I keep getting back instances where this isn't the case.

Here's the code I am using:

dat1<-data_concern_join2%>%
 group_by(locid,stdate,sttime,charnam,valunit)%>%
  filter(n()>1) 

Sample Data:

structure(list(orgid = c("USGS-NJ", "USGS-NJ", "USGS-NJ", "USGS-NJ", 
"USGS-NJ", "USGS-NJ", "USGS-NJ", "USGS-NJ", "USGS-NJ", "USGS-NJ", 
"USGS-NJ", "USGS-NJ", "USGS-NJ", "USGS-NJ", "USGS-NJ"), locid = c("USGS-01396030", 
"USGS-01396030", "USGS-01389850", "USGS-01389850", "USGS-01391500", 
"USGS-01391500", "USGS-01391500", "USGS-01391500", "USGS-01393960", 
"USGS-01390610", "USGS-01391500", "USGS-01390610", "USGS-01391500", 
"USGS-01378760", "USGS-01378760"), stdate = structure(c(11359, 
11359, 11359, 11359, 16504, 16504, 16112, 16112, 11367, 13860, 
12401, 13860, 16325, 13006, 13006), class = "Date"), sttime = structure(c(34200, 
34200, 36000, 36000, 40500, 40500, 39600, 39600, 36000, 39600, 
32400, 39600, 38400, 36900, 36900), class = c("hms", "difftime"
), units = "secs"), charnam = c("Specific conductance", "Specific conductance", 
"Specific conductance", "Specific conductance", "Specific conductance", 
"Specific conductance", "Specific conductance", "Specific conductance", 
"Specific conductance", "Specific conductance", "Specific conductance", 
"Specific conductance", "Specific conductance", "Specific conductance", 
"Specific conductance"), val = c(7450, 7300, 3210, 3180, 1930, 
1920, 1740, 1650, 1480, 1390, 1380, 1330, 1300, 1280, 1270), 
    valunit = c("uS/cm @25C", "uS/cm @25C", "uS/cm @25C", "uS/cm @25C", 
    "uS/cm @25C", "uS/cm @25C", "uS/cm @25C", "uS/cm @25C", "uS/cm @25C", 
    "uS/cm @25C", "uS/cm @25C", "uS/cm @25C", "uS/cm @25C", "uS/cm @25C", 
    "uS/cm @25C"), swqs = c("FW2-NT", "FW2-NT", "FW2-NT", "FW2-NT", 
    "FW2-NT", "FW2-NT", "FW2-NT", "FW2-NT", "FW2-NT", "FW2-NT", 
    "FW2-NT", "FW2-NT", "FW2-NT", "FW2-NT", "FW2-NT"), WMA = c(7L, 
    7L, 4L, 4L, 4L, 4L, 4L, 4L, 7L, 4L, 4L, 4L, 4L, 6L, 6L), 
    year = c(2001L, 2001L, 2001L, 2001L, 2015L, 2015L, 2014L, 
    2014L, 2001L, 2007L, 2003L, 2007L, 2014L, 2005L, 2005L), 
    HUC14 = c("HUC02030104050090", "HUC02030104050090", "HUC02030103120050", 
    "HUC02030103120050", "HUC02030103140070", "HUC02030103140070", 
    "HUC02030103140070", "HUC02030103140070", "HUC02030104050010", 
    "HUC02030103140010", "HUC02030103140070", "HUC02030103140010", 
    "HUC02030103140070", "HUC02030103010040", "HUC02030103010040"
    )), .Names = c("orgid", "locid", "stdate", "sttime", "charnam", 
"val", "valunit", "swqs", "WMA", "year", "HUC14"), row.names = c(NA, 
-15L), class = c("grouped_df", "tbl_df", "tbl", "data.frame"), vars = c("locid", 
"stdate", "sttime", "charnam", "valunit"), drop = TRUE, indices = list(
    13:14, 2:3, c(9L, 11L), 10L, 6:7, 12L, 4:5, 8L, 0:1), group_sizes = c(2L, 
2L, 2L, 1L, 2L, 1L, 2L, 1L, 2L), biggest_group_size = 2L, labels = structure(list(
    locid = c("USGS-01378760", "USGS-01389850", "USGS-01390610", 
    "USGS-01391500", "USGS-01391500", "USGS-01391500", "USGS-01391500", 
    "USGS-01393960", "USGS-01396030"), stdate = structure(c(13006, 
    11359, 13860, 12401, 16112, 16325, 16504, 11367, 11359), class = "Date"), 
    sttime = structure(c(36900, 36000, 39600, 32400, 39600, 38400, 
    40500, 36000, 34200), class = c("hms", "difftime"), units = "secs"), 
    charnam = c("Specific conductance", "Specific conductance", 
    "Specific conductance", "Specific conductance", "Specific conductance", 
    "Specific conductance", "Specific conductance", "Specific conductance", 
    "Specific conductance"), valunit = c("uS/cm @25C", "uS/cm @25C", 
    "uS/cm @25C", "uS/cm @25C", "uS/cm @25C", "uS/cm @25C", "uS/cm @25C", 
    "uS/cm @25C", "uS/cm @25C")), row.names = c(NA, -9L), class = "data.frame", vars = c("locid", 
"stdate", "sttime", "charnam", "valunit"), drop = TRUE, .Names = c("locid", 
"stdate", "sttime", "charnam", "valunit")))

We can do

library(data.table)
unique(setDT(data_concern_join2), 
       by = c('locid', 'stdate', 'sttime', 'charnam', 'valunit'))

R - find and list duplicate rows based on two columns, Here is an option using duplicated twice, second time along with fromLast = TRUE option because it returns TRUE only from the duplicate  Remove duplicate rows in a data frame. The function distinct() [dplyr package] can be used to keep only unique/distinct rows from a data frame. If there are duplicate rows, only the first row is preserved. It’s an efficient version of the R base function unique(). Remove duplicate rows based on all columns: my_data %>% distinct()

dplyr's distinct is designed for that

data_concern_join2 %>% distinct(locid, stdate, sttime, charnam, valunit, .keep_all = T)

Identify and Remove Duplicate Data in R, Summary. Remove duplicate rows based on one or more column values: my_data %>% dplyr::distinct(Sepal. Length) R base function to extract unique elements from vectors and data frames: unique(my_data) R base function to determine duplicate elements: duplicated(my_data) To find & select the duplicate all rows based on all columns call the Daraframe.duplicate() without any subset argument.  It will return a Boolean series with True at the place of each duplicated rows except their first occurrence (default value of keep argument is ‘first’).

Does this work for you?

dat1<-data_concern_join2%>%
 group_by(locid,stdate,sttime,charnam,valunit)%>%
  mutate(count = 1:n()) %>% # the number should increase with each replicated row
  filter(count == 1)  # filter only the first and remove the others

Output:

# A tibble: 9 x 12
# Groups:   locid, stdate, sttime, charnam, valunit [9]
  orgid   locid         stdate     sttime charnam                val valunit    swqs     WMA  year HUC14             count
  <chr>   <chr>         <date>     <time> <chr>                <dbl> <chr>      <chr>  <int> <int> <chr>             <int>
1 USGS-NJ USGS-01396030 2001-02-06 09:30  Specific conductance  7450 uS/cm @25C FW2-NT     7  2001 HUC02030104050090     1
2 USGS-NJ USGS-01389850 2001-02-06 10:00  Specific conductance  3210 uS/cm @25C FW2-NT     4  2001 HUC02030103120050     1
3 USGS-NJ USGS-01391500 2015-03-10 11:15  Specific conductance  1930 uS/cm @25C FW2-NT     4  2015 HUC02030103140070     1
4 USGS-NJ USGS-01391500 2014-02-11 11:00  Specific conductance  1740 uS/cm @25C FW2-NT     4  2014 HUC02030103140070     1
5 USGS-NJ USGS-01393960 2001-02-14 10:00  Specific conductance  1480 uS/cm @25C FW2-NT     7  2001 HUC02030104050010     1
6 USGS-NJ USGS-01390610 2007-12-13 11:00  Specific conductance  1390 uS/cm @25C FW2-NT     4  2007 HUC02030103140010     1
7 USGS-NJ USGS-01391500 2003-12-15 09:00  Specific conductance  1380 uS/cm @25C FW2-NT     4  2003 HUC02030103140070     1
8 USGS-NJ USGS-01391500 2014-09-12 10:40  Specific conductance  1300 uS/cm @25C FW2-NT     4  2014 HUC02030103140070     1
9 USGS-NJ USGS-01378760 2005-08-11 10:15  Specific conductance  1280 uS/cm @25C FW2-NT     6  2005 HUC02030103010040     1

How to remove entire rows of duplicates based on two different , How do people use Python and R to analyse large amounts of data if they're To retain unique combinations of only x & y columns in data frame 'data', use this,. Assigning rows of data.frame to another data.frame in R based on frequency of element's occurance 2 Reordering rows in multiple columns in a data.frame and afterwards remove rows with only NAs

First, a simpler question about removing entire rows with duplicates in one column:

library(dplyr)

df <- tibble(x=c(1,1:4), y=letters[1:5])

df
#> # A tibble: 5 x 2
#>       x y    
#>   <dbl> <chr>
#> 1     1 a    
#> 2     1 b    
#> 3     2 c    
#> 4     3 d    
#> 5     4 e   

df %>% 
  group_by(x) %>%
  mutate(rank=row_number(x)) %>%
  filter(rank==1)
  # optional: %>% ungroup() %>% select(-rank)
#> # A tibble: 4 x 3
#> # Groups:   x [4]
#>       x y      rank
#>   <dbl> <fct> <int>
#> 1     1 a         1
#> 2     2 c         1
#> 3     3 d         1
#> 4     4 e         1

It’s a common pattern in SQL if your database supports window functions, but expressed with dplyr verbs. Here’s the Redshift / PostgreSQL form:

select * from df where 1 = row_number() over (partition by x)

So, if you want just the duplicates (rows where x gets repeated) then just replace the rank==1 with rank>1:

df %>% 
  group_by(x) %>%
  mutate(rank=row_number(x)) %>%
  filter(rank>1)

#> # A tibble: 1 x 3
#> # Groups:   x [1]
#>       x y      rank
#>   <dbl> <fct> <int>
#> 1     1 b         2

The above doesn't show the original row per x, just its duplicates. If you want all rows, then I think the other answers cover it, but my version would be:

x_dupes <- 
  df %>% 
  group_by(x) %>%
  summarise(n=n()) %>%
  filter(n>1) 

df %>% semi_join(x_dupes, by = "x")

#> # A tibble: 2 x 2
#>       x y    
#>   <dbl> <chr>
#> 1     1 a    
#> 2     1 b  

Finally, you may also care about the order of the rows, as you may have a fixed view on what a duplicate is. In this example, I keep the last value of y in alphabetical order, for duplicated x.

df %>% 
  group_by(x) %>%
  arrange(desc(y)) %>%
  mutate(rank=row_number(x)) %>%
  filter(rank==1)

#> # A tibble: 4 x 3
#> # Groups:   x [4]
#>       x y      rank
#>   <dbl> <fct> <int>
#> 1     4 e         1
#> 2     3 d         1
#> 3     2 c         1
#> 4     1 b         1

Note, compared to the second example above, y was a and is now b in the duplicated row. The row order is also reversed.

Finding duplicates in data frame across columns and , row 2 has two times duplicated values (2x value 4 and 2x value 7); row 3 has three times duplicated values (3x value 6). Our pool of possible  You want to do compare two or more data frames and find rows that appear in more than one data frame, or rows that appear only in one data frame. Solution An example. Suppose you have the following three data frames, and you want to know whether each row from each data frame appears in at least one of the other data frames.

Determine Duplicate Rows, unique returns a data.table with duplicated rows removed, by columns when x is an atomic vector , and nrow(unique(x)) when x is a data.frame or data.table . v1.9.4 introduces anyDuplicated method for data.tables and is similar to base in  If you data frame is called `data` and the columns are called x,y,z etc, one can entire rows of duplicates based on any combination of columns using the ‘distinct’ function in dplyr. Self-service data science and analytics. Automate data prep and analytics at scale using code-free,

duplicated: Determine Duplicate Rows in data.table: Extension of , unique returns a data table with duplicated rows (by key) removed, or (when no key) keyed columns. by=NULL uses all columns and acts like the analogous data.frame methods. From v1.9.4 , both duplicated and unique methods also gain the logical argument fromLast , as in base, and by R Package Documentation. Determines which elements of a vector or data frame are duplicates of elements with smaller subscripts, and returns a logical vector indicating which elements (rows) are duplicates. So duplicated returns a logical vector, which we can then use to extract a subset of dat: ind <- duplicated(dat[,1:2]) dat[ind,]

Comparing data frames, Joining the data frames; Finding duplicated rows; Finding unique rows; Splitting apart the data frame; Ignoring columns You want to do compare two or more data frames and find rows that appear in more than one data frame, or rows that  Determine Duplicate Rows Description. duplicated returns a logical vector indicating which rows of a data.table are duplicates of a row with smaller subscripts. unique returns a data.table with duplicated rows removed, by columns specified in by argument. When no by then duplicated rows by all columns are removed.

Comments
  • I'm still getting instances where this is not met. I only want the instances where locid,stdate,sttime,charnam,valunit are the same and val is the only different column.
  • @KWANGER Based on the description, the code should work unless there are some issues in your data. The filter(n() > 1) gets the rows that are duplicated based on the grouping variable. if that is not the case, something is wrong
  • The code works but I'm still getting the instances where there is only one measurement. I want to create a data frame like the image added in the question. Where is shows both the instances that has all those columns the same but the value.
  • @KWANGER From your comments, it seems like data_concern_join2 %>% group_by(locid, stdate, sttime, charnam, valunit) %>% filter(n_distinct(val) == n())
  • I'm still getting instances where this is not met. I only want the instances where locid,stdate,sttime,charnam,valunit are the same and val is the only different column. I want the dataframe to look like the image I added in the question