matching and filling in blanks of data frame in R

I have data with double entries, that looks like this:

+-----+-------+-----------+-----------+--------+
| id  | first |   last    | birthyear | father |
+-----+-------+-----------+-----------+--------+
| a12 | linda | john      | 1991      | NA     |
| 3n8 | max   | well      | 1915      | NA     |
| 15z | linda | NA        | 1991      | dan    |
| 1y9 | pam   | degeneres | 1855      | NA     |
| 84z | NA    | degeneres | 1950      | hank   |
| 9i5 | max   | well      | NA        | mike   |
+-----+-------+-----------+-----------+--------+

There are multiple entries for a single person, but each entry has unique data that needs to be preserved. I want to merge these entries, keeping all information. Only the "id" column does not have to match, i want to keep the first "id" entry in the list as the final "id". So my final dataframe would look like this:

+-----+-------+-----------+-----------+--------+
| id  | first |   last    | birthyear | father |
+-----+-------+-----------+-----------+--------+
| a12 | linda | john      | 1991      | dan    |
| 3n8 | max   | well      | 1915      | mike   |
| 1y9 | pam   | degeneres | 1855      | NA     |
| 84z | NA    | degeneres | 1950      | hank   |
+-----+-------+-----------+-----------+--------+

In this example, there are two entries with last name "degeneres" who did not get merged because the birthyear does not match. The entries where there were only matching entries (aside from NAs) did get merged.

So far, the farthest i got is generating a list ordered by matching first names:

df <- data.frame(id = c("a12", "3n8", "15z", "1y9", "84z", "9i5"), first = c("linda", "max", "linda", "pam", NA, "max"), last = c("john", "well", NA, "degeneres", "degeneres", "well"), birthyear = c("1991", "1915", "1991", "1855", "1950", NA), father = c(NA, NA, "dan", NA, "hank", "mike"), stringsAsFactors = F)

name_list <- list()
i <- 1
for(n in df$first) {
  name_list[[i]] <- df[df$first == n,]
  i <<- i + 1
}

I also tried to apply merge in a meaningful way, but that does not give me the desired results:

merge(x = df, y = df, by = c("first", "last", "birthyear", "father"))

+---------+-----------+-----------+--------+------+------+
|   first |   last    | birthyear | father | id.x | id.y |
+---------+-----------+-----------+--------+------+------+
| linda   | john      | 1991      | <NA>   | a12  | a12  |
| linda   | NA        | 1991      | dan    | 15z  | 15z  |
| max     | well      | 1915      | NA     | 3n8  | 3n8  |
| max     | well      | NA        | mike   | 9i5  | 9i5  |
| NA      | degeneres | 1950      | hank   | 84z  | 84z  |
| pam     | degeneres | 1855      | NA     | 1y9  | 1y9  |
+---------+-----------+-----------+--------+------+------+

how could i best proceed?

EDIT:

Thanks for the responses so far! Just to be clear: I don't want to be conservative in determining which row describes a unique person. For example, this input:

+-----+-------+------+-----------+--------+
| id  | first | last | birthyear | father |
+-----+-------+------+-----------+--------+
| 9i5 | max   | well | NA        | mike   |
| 9i6 | dan   | well | NA        | mike   |
| 9i7 | NA    | well | NA        | NA     |
+-----+-------+------+-----------+--------+

needs to give this output:

+-----+-------+------+-----------+--------+
| id  | first | last | birthyear | father |
+-----+-------+------+-----------+--------+
| 9i5 | max   | well | NA        | mike   |
| 9i6 | dan   | well | NA        | mike   |
+-----+-------+------+-----------+--------+

EDIT2:

So i've used Adam's script on my data set. It works great, there is only a hiccup because of exactly the problem that Salix predicted/found. I have a row with very little data about my woman named Linda. Turns out, there are two Linda's that are definitely unique, and a third entry named Linda with no further information.

The script is now trying to match the unknown Linda to both of the other two unique Linda's. I've traced the issue down to a collision in the merge_id object. For my data set, it looks like this:

+------+------+
| V1   | V2   |
+------+------+
|  188 |  916 |
|  188 | 1048 |
|  752 | 1048 |
|  916 | 1048 |
| 1048 | 1058 |
+------+------+

As you can see, person 1048 matches with people who do not match with eachother. So for example 188 - 916 - 1048 could all be the same person, because 188 matches 916, 188 matches 1048 and 916 matches 1048. All fine.

But then person 752 also matches with 1048, but does not match with 188 or 916. Ergo, 1048 does not have enough information and needs to be deleted.

I'm trying to come up with a function that detects this collision and deletes 1048 from the dataset.

I am not sure how efficient this is, but this seems to be working. I am using 3 custom functions.

This would be easy to adapt to have it not match if there are two many NA's in a row and other cases such as that. The main trick is in f_compare() to build two dataframes representing each row combination to compare against rather than looping.

Functions:

f_compare()

Compare each row to itself. We use combn() to develop all the unique row combinations. This function will return a matrix with two columns. These two columns specify row numbers which are "duplicates" - defined as ignoring the first column and by counting NA as a match.

Edit: Extended the logic to force a minimum number of fields to actually match by value rather than from an NA wildcard. We require the number of TRUE match values in the row plus the number of NA values in the row to equal the total number of fields.

Edit2: Added a check to see if a record is "bad". If there are matched pairs (a, b) and (c, b), it checks to see if (a, c) or (c, a) is a pair as well. If not, then the offending record c is dropped from the database. Again, this could fairly easily be adapted to flag rather than drop the row.

f_compare <- function(dat, .min_match = 1, .exclude_cols = c(1)) {
  
  # grid of unique row id combinations
  dat_rows <- t(combn(seq_len(nrow(dat)), 2))
  
  # grid of all row id combinations (e.g., (1, 2) and (2, 1))
  dat_rows_all <- expand.grid(seq_len(nrow(dat)), seq_len(nrow(dat)))
  dat_rows_all <- dat_rows_all[dat_rows_all[,1] != dat_rows_all[,2], ]
  
  # function to find record matches based on a grid specification
  f_match <- function(dat, dat_rows, .min_match, .exclude_cols) {
    
    compare <- dat[dat_rows[, 1], -.exclude_cols] == dat[dat_rows[, 2], -.exclude_cols]
    
    row_true <- rowSums(compare, na.rm = TRUE)
    row_na <- rowSums(is.na(compare))
    
    which_rows <- which((row_true >= .min_match) & (row_true + row_na == ncol(compare)))
    rbind(dat_rows[which_rows,])
    
  }
  
  # matches for each grid
  match_rows <- f_match(dat, dat_rows, .min_match, .exclude_cols)
  match_rows_all <- f_match(dat, dat_rows_all, .min_match, .exclude_cols)
  
  # function to determine if it is a "bad" record
  f_bad <- function(check_index, id_comb, id_all) {
    
    if (length(id_comb[id_comb[,2] == check_index, 1]) > 1) {
      trans_rows <- t(combn(id_comb[id_comb[,2] == check_index, 1], 2))
    
      compare_trans <- id_all[rep(seq_len(nrow(id_all)), times = nrow(trans_rows)),] == trans_rows[rep(seq_len(nrow(trans_rows)), each = nrow(id_all)),]
    
      return(!any(rowSums(compare_trans) == ncol(compare_trans)))
    } else {
      return(FALSE)
    }
  }
  
  # check all rows with a potential match to see if it is "bad"
  check_ids <- unique(match_rows[,2])
  
  if (length(check_ids) > 0) {
    bad_ids <- check_ids[sapply(check_ids, f_bad, match_rows, match_rows_all)]
  } else {
    bad_ids = check_ids
  }
  
  list(id = rbind(match_rows[!(match_rows[,2] %in% bad_ids), ]), bad_id = bad_ids)

}

f_merge()

Given two vectors, x and y, fill in NA slots in each with values from the other.

f_merge <- function(id, dat) {
  x <- dat[id[1],]
  y <- dat[id[2],]
  
  y[is.na(y)] <- x[is.na(y)]
  x[is.na(x)] <- y[is.na(x)]
  
  x
}

merge_records()

Recursively work through the data set until there are no more duplicates left to merge. There is some case logic in here to account for things like R converting a single row matrix to a vector and when to exit the recursion.

Edit2: Modified the merge to drop "bad" records.

merge_records <- function(dat) {
  merge_id <- f_compare(dat)
  
  # drop bad rows
  if (length(merge_id$bad_id) > 0) {
    dat <- dat[-merge_id$bad_id,]
  }
  
  dat2 <- do.call("rbind", apply(merge_id$id, 1, f_merge, dat = dat))
  dat2 <- rbind(dat2, dat[which(!(seq_len(nrow(dat)) %in% c(merge_id$id))), ])
  
  if (nrow(dat2) > 1) {
    dat2 <- dat2[which(!(duplicated(dat2))),]
    
    if (nrow(f_compare(dat2)$id) > 0) merge_records(dat2) else return(dat2)
  } else {
    dat2
  }
  
}
End Result:
merge_records(df)
    id first      last birthyear father
1  a12 linda      john      1991    dan
21 3n8   max      well      1915   mike
5  1y9   pam degeneres      1855   <NA>
6  84z  <NA> degeneres      1950   hank

Fill in missing values with previous or next value — fill • tidyr, Source: R/fill.R. fill.Rd. Fills missing data. A data frame. < tidy-select > Columns to fill. .direction. Direction in which to fill missing values. Currently either � Full match. A full match returns values that have a counterpart in the destination table. The values that are not match won't be return in the new data frame. The partial match, however, return the missing values as NA. We will see a simple inner join. The inner join keyword selects records that have matching values in both tables.

You can order your df by first and then check if each row matches the one under it. When it does, you replace the NA values with the values of the other row. And then you can remove duplicates.

I twweked the previous function to have the best merge while keeping IDs, and simplified it since you didn't need the extra params.

I added entries to better test.

new function :

 merge_rows <- function(df, orderCol = 0){
  if(orderCol[1]==0){ #if no column is used to sort --> replace missing values
    df <- merge_rows(df)
  } else { #else --> sort, then replace missing values
    for(L in c(T, F)){ #depending on how NAs are ordered, you get different results, so doing it both ways to assure better merging
      for(i in 1:length(orderCol)){
        df <- df[order(df[orderCol[i]], na.last=L),]
        row = 2 #counter and not for loop because size of df changes
        while(row <= nrow(df)){
          r1 = row-1
          r2 = row
          #compare the 2 rows & checks that it's a match (no conflicting data)
          is_match = sum((df[r1,-1] == df[r2,-1])==F, na.rm = T) == 0  # -1 for id col
          #if it's a match --> fill missing info of row 1 and remove row 2
          if(is_match) {
            df[r1, is.na(df[r1,])] <- df[r2, is.na(df[r1,])]
            df <- df[-r2, ]
          } 
          row = row+1
        }
      }
    }
  }
  rownames(df) <- NULL #rename your row in order
  return(df) #return new df
}

merge_rows(df, 2:3) #in my case 2:3 gave same result as 2:5, depending on your columns, you might need to adjust

result:

   id first      last birthyear father
1 84z  <NA> degeneres      1950   hank
2 1y9   pam degeneres      1855   <NA>
3 a12 linda      john      1991    dan
4 d33 linda      well      1991 robert
5 3n8   max      well      1915   mike

df used :

df <- data.frame(
    id = c("d33","d34","a11", "a12", "3n8", "15z", "ba4", "1y9", "84z", "9i5"), 
    first = c("linda","linda",NA, "linda", "max", "linda", "max", "pam", NA, "max"), 
    last = c("well","well","john", "john", "well", NA, NA, "degeneres", "degeneres", "well"), 
    birthyear = c("1991","1991","1991", "1991", "1915", "1991", NA, "1855", "1950", NA), 
    father = c(NA,"robert",NA, NA, NA, "dan", NA, NA, "hank", "mike"), 
    stringsAsFactors = T)

rbind.fill: Combine data.frames by row, filling in missing columns. in , In plyr: Tools for Splitting, Applying and Combining Data. Description Usage Arguments Details Value See Also Examples. View source: R/rbind-fill.r. Description. match.data.frame. From Ecfun v0.2-2 by Spencer Graves. 0th. Percentile. Identify the row of y best matching each row of x. For each row of x[, by.x], find the best

Posting as another answer because it uses a different method.

Inspired by Adam's function and the refining of my original function, I made one small and simplified function that gives the same result as my other one while being at least twice as fast (according to the benchmark function).

mergeRows <- function(df) {
  #list of rows to compare
  rows <- t(combn(1:nrow(df), 2))
  #finds first pair of rows with no conflicting dfa (no need to check that there's a match if we know there's no false match)
  is_T = which(rowSums((df[rows[, 1],-1] == df[rows[, 2],-1])==F, na.rm = T) == 0)[1]
  while(!is.na(is_T)){ 
    id <- rows[is_T,]
    df[id[1], is.na(df[id[1],])] <- df[id[2], is.na(df[id[1],])]
    df <- df[-id[2],]
    #list of rows to compare from reduce df
    rows <- t(combn(1:nrow(df), 2))
    #finds next pair of rows with no conflicting dfa
    is_T = which(rowSums((df[rows[, 1],-1] == df[rows[, 2],-1])==F, na.rm = T) == 0)[1]
  }
  rownames(df) <- NULL #optional --> renames them in order
  return(df)
}

mergeRows(df)

Filling blanks in a dataset with R | by Gianluca Malato, Test set: test data frame. There follows a simple R code that loops over the columns and performs the blank filling according to these criteria. The� Sample Random Rows of Data Frame; Extract Certain Columns of Data Frame; The R Programming Language . To summarize: This article explained how to return rows according to a matching condition in the R programming language. Please let me know in the comments, if you have further questions.

rbind.fill function, input data frames to row bind together. The first argument can be a list of data frames, in which case all other arguments are ignored. Any NULL inputs are silently� NA stand for Not Available, and is the way of R to represent missing values, any other form is treated as a character string i.e. c("N/A", "null", "") %>% this is called the pipe operator and concatenates commands together to make code more readable, the previous code would be equivalent to

Creating an empty data.frame with only column names, I want to create an empty dataframe with these column names: (Fruit, Cost, Quantity). No data Creating an empty data.frame with only column names - R You can try using match: data <- data.frame(alphabets=letters[1:4], . Validation set: valid data frame; Test set: test data frame; There follows a simple R code that loops over the columns and performs the blank filling according to these criteria. The datasets aren

How to fill missing value based on other columns in Pandas , Assuming three columns of your dataframe is a , b and c . This is what you want: df['c'] = df.apply( lambda row: row['a']*row['b'] if np.isnan(row['c']) else row['c'],� Create Empty Data Frame in R (2 Examples) In this article, I’ll explain how to create an empty data frame in the R programming language. I will show you two programming alternatives for the creation of an empty data frame. More precisely, the tutorial will contain the following topics: Alternative 1: Initialize Empty Vectors in data.frame

Comments
  • Are you just trying to fill in NAs where you can?
  • yes. unless there is conflicting data, then i want to treat each entry as separate.
  • Could you provide an example of conflicting data?
  • This is an interesting problem and might be trickier than it looks. Is there a threshold for amount of overlap you want to use? I could imagine a "greedy" implementation leading to excessive matches; e.g. if you had a row with "linda" and "1991" does that mean all lindas with missing years should be assigned to 1991, even though only one with a year was found?
  • ...and what if you also have a row with only "linda" and "1980"? Can you use neither since there isn't enough to uniquely connect either one to other rows with "linda" and missing years?
  • to test my thing I added some more entries to have more variety on where the NAs were, have 2 persons with the same last name, and I tried this on that df and it gives weird results :/
  • the df for ref : df <- data.frame( id = c("d33","d34","a11", "a12", "3n8", "15z", "ba4", "1y9", "84z", "9i5"), first = c("linda","linda",NA, "linda", "max", "linda", "max", "pam", NA, "max"), last = c("well","well","john", "john", "well", NA, NA, "degeneres", "degeneres", "well"), birthyear = c("1992","1991","1991", "1991", "1915", "1991", NA, "1855", "1950", NA), father = c("robert","robert",NA, NA, NA, "dan", NA, NA, "hank", "mike"), stringsAsFactors = T)
  • don't even know if op checked the answers, but did you try yours with more than 1 .min_match? it duplicates ids. Also, you forgot to make the new args passable from the main one. But, yeah, playing around with your script, the only way I found to not duplicate ids (and sometimes have weird results) was to use f_merge() with only the first row of merge_id, remove duplicates and then recheck merge_id with new data, etc, until merge_id had no rows.
  • hey guys, sorry for the slow reply, i haven't been around much in the weekend. i'm testing both of your scripts now, they look promising! thanks for the effort :)
  • Hey Adam, so i've run your script and it is definitely the most promising one so far! My dataset went from 1243 rows to 1227 rows. That may not seem like much, but the reduction combined with the knowledge that none of these people have the same info is exactly what i'm looking for! Thank you so much for your work.
  • I have run your script, but I don't understand exactly what the approach is. Before I start tinkering myself, I have some questions:
  • 1 - in your last line you call mergeRows(df1, col, colToMatch) but "col" is not an object. Should this be "colToSort"?
  • 2 - related to (1), the function mergeRows expects a variable "orderCol" but this variable is never used in the function. What is it's purpose?
  • 3 - In your third edit you said I would not want to assume every person with the same last name and birthdate are the same person - for the purpose of my script, i want exactly this. If someone is only known as "max", i want to merge it with any other "max" and don't keep that row as a separate person.
  • additionally: the dataset that i'm actually working on has 1243 rows, and 11 columns (so there are more variables than in my example). When i run this script without edits, it returns 1243 rows - nothing gets matched. I think your script focuses on getting matching names, while for this set, any matching info will do.