how to determine duplicate rows with respect of a group and then select whole element of that group

how to find duplicate records in sql without group by
sql select duplicate rows based on one column
select rows where column contains same data in more than one record
sql query to find duplicate rows in a table
sql query to find duplicate records in a column
how to find duplicate records in oracle
find duplicate rows in sql with multiple columns
delete duplicate records in sql

there are 3 columns. SAMPN is a household index, PERNO is persons index in each family, and other columns are related to trip of each person. I want to pick some rows whose have the same value for some or all family member, and for all PERNO even if some rows for that PERNO is not duplicate. Notice that plz that it is not finding duplicate rows.

Example :

              SAMPN    PERNO       time
                1        1          19:00
                1        1          18:00
                1        1          20:00
                1        2          20:00
                1        3          15:00
                1        3          21:00
                2        1          19:00
                2        1          18:00
                2        2          20:00
                2        2          21:00
                2        3          19:00
                2        3          21:00
                2        4          1:00
                2        4          8:00

First family SAMPN==1

first person PERNO==1 and second person PERNO==2 have the same time, so whole rows for person 1 and 2 must be selected.

Second family SAMPN==2

first person PERNO==1 and second person PERNO==3 have the same time at time==19, so whole rows for person 1 and 3 must be selected. Also PERNO==2 and PERNO==3 have the same time at time==21

output :

              SAMPN    PERNO       time
                1        3          15:00
                1        3          21:00
                2        4          1:00
                2        4          8:00

We can get the PERNO for all the duplicated time and select rows which do not have any duplicated time.

df %>%
  group_by(SAMPN) %>%
  filter(!PERNO %in% unique(PERNO[duplicated(time) | duplicated(time, fromLast = TRUE)]))

#  SAMPN PERNO time 
#  <int> <int> <chr>
#1     1     3 15:00
#2     1     3 21:00
#3     2     4 1:00 
#4     2     4 8:00 

How to Find Duplicate Values in a SQL Table, users GROUP BY username, The initial SELECT simply and then inner joins it with the includes all of the row ids, We respect your privacy. A key factor in determining  First, the GROUP BY clause groups the rows into groups by values in both a and b columns. Second, the COUNT () function returns the number of occurrences of each group (a,b). Third, the HAVING clause keeps only duplicate groups, which are groups that have more than one occurrence.

A solution using dplyr.


dat2 <- dat %>%
  group_by(SAMPN) %>%
  mutate(D = !duplicated(time) & !duplicated(time, fromLast = TRUE)) %>%
  group_by(SAMPN, PERNO) %>%
  filter(all(D)) %>%
  ungroup() %>%
# # A tibble: 4 x 3
#   SAMPN PERNO time 
#   <int> <int> <chr>
# 1     1     3 15:00
# 2     1     3 21:00
# 3     2     4 1:00 
# 4     2     4 8:00


dat <- read.table(text = "              SAMPN    PERNO       time
                1        1          '19:00'
                1        1          '18:00'
                1        1          '20:00'
                1        2          '20:00'
                1        3          '15:00'
                1        3          '21:00'
                2        1          '19:00'
                2        1          '18:00'
                2        2          '20:00'
                2        2          '21:00'
                2        3          '19:00'
                2        3          '21:00'
                2        4          '1:00'
                2        4          '8:00'",
                  header = TRUE, stringsAsFactors = FALSE)

Object-Oriented Application Development Using the Caché , 8.2.1 .1 Data Query Language (DQL) The SELECT command The SELECT The DISTINCT or ALL parameter is optional. The column names after GROUP BY define row groups with matching  To find duplicates rows in a table you need to use a Select statement that contains group by with having keyword. Another option is to use the ranking function Row_Number (). Find duplicates rows - Group By

An option with anti_join

anti_join(df1, df1[duplicated(df1[c(1, 3)])|duplicated(df1[c(1, 3)], 
      fromLast = TRUE), c("SAMPN", "PERNO")])
#     SAMPN PERNO  time
#1     1     3 15:00
#2     1     3 21:00
#3     2     4  1:00
#4     2     4  8:00

Or with only tidyverse syntax

df1 %>% 
   group_by(SAMPN, time) %>%
   filter(n() > 1) %>% 
   ungroup %>% 
   select(-time) %>% 
   anti_join(df1, .)

Or another single line option is a join with data.table

setDT(df1)[!(df1[df1[, .I[.N > 1], .(SAMPN, time)]$V1, 
             .(SAMPN, PERNO)]), on = .(SAMPN, PERNO)]
#  SAMPN PERNO  time
#1:     1     3 15:00
#2:     1     3 21:00
#3:     2     4  1:00
#4:     2     4  8:00

Or with base R

subset(df1, ! paste(SAMPN, PERNO) %in%, subset(df1, 
      ave(seq_along(time), SAMPN, time, FUN = length)  > 1, select = -time)))
df1 <- structure(list(SAMPN = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L), PERNO = c(1L, 1L, 1L, 2L, 3L, 3L, 1L, 1L, 
2L, 2L, 3L, 3L, 4L, 4L), time = c("19:00", "18:00", "20:00", 
"20:00", "15:00", "21:00", "19:00", "18:00", "20:00", "21:00", 
"19:00", "21:00", "1:00", "8:00")), class = "data.frame", row.names = c(NA, 

Finding Duplicates with SQL, By using group by and then having a count greater than one, we find rows with with in the following format element name Table name m A m B thanks for ur help "Hi Padam. retrieve all columns from duplicate records like this: SELECT  I'm trying to show duplicates by group using data.table. More specifically I'm trying to find out whether there are multiple observations for a country in a given year. Here's a sample dataset: #

Microsoft Office 2013: Illustrated Projects, Switch to Design view, type Herbal in the Criteria cell for Category, click the Run button in the Results group, then close  You can use either one of the SQL statement listed below to filter the duplicate records. 1) SELECT DISTINCT customerno, propertyno FROM property_details 2) SELECT customerno, propertyno FROM property_details GROUP BY customerno, propertyno

Finding Duplicate Rows in SQL Server, First, define criteria for duplicates: values in a single column or multiple columns. Then, insert some rows into the t1 table: This statement uses the GROUP BY clause to find the duplicate rows in both a and b columns All Rights Reserved. Remove duplicate rows in a data frame. The function distinct() [dplyr package] can be used to keep only unique/distinct rows from a data frame. If there are duplicate rows, only the first row is preserved. It’s an efficient version of the R base function unique(). Remove duplicate rows based on all columns: my_data %>% distinct()

How to Find and Delete Duplicate Rows with SQL, An explanation of how to find rows with duplicate values in a table You can happily remove one of these from the table. Using group by is one of the easiest. This finds, then deletes all the rows that are not the oldest in  To create a nested (or inner) group, select all detail rows above the related summary row, and click the Group button. For example, to create the Apples group within the East region, select rows 2 and 3, and hit Group. To make the Oranges group, select rows 5 through 7, and press the Group button again.

  • you say that you want to select those common rows but in output you show the exact opposite output. Do you want to remove those common rows?
  • yes, and it would be great if I save them in another data frame
  • well I have more columns than time and I think duplicated accept 1 column
  • @hghg A workaround would be to paste all the columns together for which you want to check duplicated and then use duplicated on that column. Like df %>% mutate(common = paste0(time, col1, col2)) %>% group_by(SAMPN) %>% filter(!PERNO %in% unique(PERNO[duplicated(common) | duplicated(common, fromLast = TRUE)]))
  • Error in duplicated.default(start_hr, start_min, TRPDUR, ACTDUR, fromLast = TRUE) : hash table is full
  • I just added more column next to time
  • @hghg If my answer works on your example data frame, but not on your real-world data frame. That means your example data frame is not representing your real-world data frame well. Please consider asking a new question with proper examples.
  • I just have more columns than time, my question is exactly the same. So if I put more columns next to time in your code it will not work?
  • I don't know what you want to achieve.