Merge Tables: classify output according to how rows where joined

sql join multiple tables with conditions
inner join
sql join 3 tables with where clause
sql server join multiple tables
how to join multiple tables in sql
sql join 3 tables on same key
left join 3 tables
how to join 4 tables in sql query

While joining (large, complex) tables in R, I usually struggle to sanity check the results from this operation. See here a minimal reproducible example:

library(data.table) 
table1 <- data.table(id=c("A", "B", "C"), price=c(12,11,10))
table2 <- data.table(id=c("A", "C", "C", "D"), wharehouse=c("Colorado","Texas","New York", "Oregon"))


table_join <- merge(table1,table2,
                    by="id",
                    all.x=T,
                    all.y=T)

The expected output it is not rocket science:

structure(list(id = c("A", "B", "C", "C", "D"), price = c(12, 
                                                          11, 10, 10, NA), wharehouse = c("Colorado", NA, "Texas", "New York", 
                                                                                          "Oregon"), join = c("INNER JOIN", "LEFT JOIN", "INNER JOIN. MULTIPLE RIGHT JOIN", 
                                                                                                              "INNER JOIN. MULTIPLE RIGHT JOIN", "RIGHT JOIN")), row.names = c(NA, 
                                                                                                                                                                               -5L), class = c("data.table", "data.frame"))

But I would like to know how many rows did not match the other table, have one match, multiple matches...

I would like some info (maybe a new row) describing how resulted from the merge. See a possible example:

table_join[1, join:="INNER JOIN"]
table_join[2, join:="LEFT JOIN"]
table_join[3, join:="INNER JOIN. MULTIPLE RIGHT JOIN"]
table_join[4, join:="INNER JOIN. MULTIPLE RIGHT JOIN"]
table_join[5, join:="RIGHT JOIN"]

Here the "expected output"

structure(list(id = c("A", "B", "C", "C", "D"), price = c(12, 
11, 10, 10, NA), wharehouse = c("Colorado", NA, "Texas", "New York", 
"Oregon"), join = c("INNER JOIN", "LEFT JOIN", "INNER JOIN. MULTIPLE RIGHT JOIN", 
"INNER JOIN. MULTIPLE RIGHT JOIN", "RIGHT JOIN")), row.names = c(NA, 
-5L), class = c("data.table", "data.frame"))

For sure, with big tables could appear new situations (full cartesian joins), matches were the id exists on the other table but with NAs (in my example let say we have id D but price NA).

Also, it will help me to trace complex situations when merging multiple tables

Does a wrapper of merge exist in R that performs such operation? I remember when a was a young, dreamy Research Assistant that Stata was able to do something related, but I do not know how to do it automatically in R.

The problem per se is quite easy and it is solved by tracking the id's frequency of each table. Below follows my solution, but it will probably need some optimization for large tables.

EDIT1:

Bug fixing: category was overwritten; eval(track.col) instead of eval(parse(text = track.col)))

Moreover it is now possible to assign a custom name to the join column.

library(data.table)

track.merge <- function(x, y, on, track.col){

  x[, N := .N, by = on][]
  y[, N := .N, by = on][]

  table_join <- merge(x, y, by=on, all.x=T, all.y=T)

  x[, N := NULL, by = on][]
  y[, N := NULL, by = on][]

  table_join[N.x > 1 & N.y > 1,                              
             eval(track.col) := "INNER JOIN. MULTIPLE LEFT RIGHT JOIN"][]

  table_join[N.x > 1 & is.na(eval(parse(text = track.col))), 
             eval(track.col) := "INNER JOIN. MULTIPLE LEFT JOIN"][]

  table_join[N.y > 1 & is.na(eval(parse(text = track.col))), 
             eval(track.col) := "INNER JOIN. MULTIPLE RIGHT JOIN"][]

  table_join[is.na(N.x),                                     
             eval(track.col) := "RIGHT JOIN"][]

  table_join[is.na(N.y),                                
             eval(track.col) := "LEFT JOIN"][]

  table_join[is.na(eval(parse(text = track.col))),      
             eval(track.col) := "INNER JOIN"][]

  table_join[, ':=' (N.x = NULL, N.y = NULL)][]
}

EDIT2

A much more readable version of the same function

track.merge2 <- function(x, y, on, track.col){

  x[, N := .N, by = on][]
  y[, N := .N, by = on][]

  table_join <- merge(x, y, by=on, all.x=T, all.y=T)
  track_ids <- character(NROW(table_join))

  x[, N := NULL, by = on][]
  y[, N := NULL, by = on][]

  track_ids[table_join$N.x > 1 & table_join$N.y > 1] <- "INNER JOIN. MULTIPLE LEFT RIGHT JOIN"
  track_ids[table_join$N.x > 1 & track_ids == ""]    <- "INNER JOIN. MULTIPLE LEFT JOIN"
  track_ids[table_join$N.y > 1 & track_ids == ""]    <- "INNER JOIN. MULTIPLE RIGHT JOIN"
  track_ids[is.na(table_join$N.x)]                   <- "RIGHT JOIN"
  track_ids[is.na(table_join$N.y)]                   <- "LEFT JOIN"
  track_ids[track_ids == ""]                         <- "INNER JOIN"

  table_join[[track.col]] <- track_ids
  table_join[, ':=' (N.x = NULL, N.y = NULL)][]
}

TEST:

table1 <- data.table(id=c("A", "C", "C", "B", "F", "H", "H"), price=c(12,11,10,13,10,15,3)) 
table2 <- data.table(id=c("A", "C", "C", "F", "F", "H", "L"), wharehouse=c("Colorado","Texas","New York", "Washington", "Illinois", "Florida", "Kansas")) 

> table1
   id price
1:  A    12
2:  C    11
3:  C    10
4:  B    13
5:  F    10
6:  H    15
7:  H     3

> table2
   id wharehouse
1:  A   Colorado
2:  C      Texas
3:  C   New York
4:  F Washington
5:  F   Illinois
6:  H    Florida
7:  L     Kansas

> track.merge(table1, table2, "id", "join")
    id price wharehouse                                 join
 1:  A    12   Colorado                           INNER JOIN
 2:  B    13       <NA>                            LEFT JOIN
 3:  C    11      Texas INNER JOIN. MULTIPLE LEFT RIGHT JOIN
 4:  C    11   New York INNER JOIN. MULTIPLE LEFT RIGHT JOIN
 5:  C    10      Texas INNER JOIN. MULTIPLE LEFT RIGHT JOIN
 6:  C    10   New York INNER JOIN. MULTIPLE LEFT RIGHT JOIN
 7:  F    10 Washington      INNER JOIN. MULTIPLE RIGHT JOIN
 8:  F    10   Illinois      INNER JOIN. MULTIPLE RIGHT JOIN
 9:  H    15    Florida       INNER JOIN. MULTIPLE LEFT JOIN
10:  H     3    Florida       INNER JOIN. MULTIPLE LEFT JOIN
11:  L    NA     Kansas                           RIGHT JOIN


> all.equal(track.merge2(x, y, on = "id", "join"), track.merge(x, y, on = "id", "join"))
[1] TRUE

Considering tables with 1000k rows and two columns merging is ~2.5x slower:

library(microbenchmark)

table1 <- data.table(id = sample(1e+6, 1e+6, replace = T), price = rnorm(1e+6))
table2 <- data.table(id = sample(1e+6, 1e+6, replace = T), state = sample(LETTERS, 1e+6, replace = T))

microbenchmark(merge        = merge(table1,table2, by="id", all.x=T, all.y=T),
               track.merge  = track.merge(table1, table2, "id", "join"), 
               track.merge2 = track.merge2(table1, table2, "id", "join"), 
               times = 10L)

         expr       min       lq      mean    median        uq       max neval cld
        merge  569.7727  573.433  577.8784  577.2759  581.9219  586.9951    10 a  
  track.merge 1456.4417 1536.982 1545.6427 1556.5226 1563.6150 1623.3078    10   c
 track.merge2 1392.6832 1464.968 1460.2484 1471.0332 1477.2330 1487.1828    10  b 

COMMENT ON MISSING IDS

From a database perspective it does not make much sense having NAs as id. Ids are the key to relate your tables into a relational database. If there are records with missing ids it is pointless to relate them with other table, so I'd either filter them out or try to fix them before merging tables.

SQL multiple joins for beginners with examples, The joins allow us to combine data from two or more tables so that we are able to join Inner join returns the rows that match in both tables The grey-colored area indicates rows which will be the output of the query: Question: Please generate the proper query according to the below Venn diagram. In the MERGE statement, the TOP clause specifies the number or percentage of rows that are affected after the source table and the target table are joined, and after rows that do not qualify for an insert, update, or delete action are removed.

Here is my solution using dplyr. Just as @Gerald T said, you can get all the information by looking at the merged table.

You can get the frequency table using this code.

library(tidyverse)
table1 %>% left_join(table2) %>% 
      group_by(id) %>%
      summarise(num_wharehouse = sum(!is.na(wharehouse))) 
Joining, by = "id"
# A tibble: 3 x 2
  id    num_wharehouse
  <chr>          <int>
1 A                  1
2 B                  0
3 C                  2

Then you can get the statistics you want.

table1 %>% left_join(table2) %>% 
          group_by(id) %>%
          summarise(num_wharehouse = sum(!is.na(wharehouse))) %>%
          summarise(merged = sum(num_wharehouse > 0),
                    not_merged = sum(num_wharehouse == 0), 
                    single_match = sum(num_wharehouse == 1),
                    multi_match = sum(num_wharehouse > 1))
Joining, by = "id"
# A tibble: 1 x 4
  merged not_merged single_match multi_match
   <int>      <int>        <int>       <int>
1      2          1            1           1

Learn SQL: Join multiple tables, To achieve that, we'll combine INNER JOINs and LEFT JOINs. For each call, we want to display what was the outcome as well the first and the last name of the This will result in returning only rows having pairs in another table; When you' re using Auditing (37); Data classification (1); Data masking (7). rows shows the estimated number of rows examined and rows × filtered shows the number of rows that will be joined with the following table. For example, if rows is 1000 and filtered is 50.00 (50%), the number of rows to be joined with the following table is 1000 × 50% = 500.

About joining and relating tables—ArcMap, Joining the attributes from a table; Joining data by location (spatially) ArcGIS allows you to associate records in one table with records in another table Suppose you have a layer where each polygon is classified according to its land- use type. are not dynamic and require the results to be saved to a new output layer. merge is a function in the pandas namespace, and it is also available as a DataFrame instance method merge(), with the calling DataFrame being implicitly considered the left object in the join. The related join() method, uses merge internally for the index-on-index (by default) and column(s)-on-index join.

A wrapper function can be applied following the join to analyze its path as follows. Scenarios given in OP and comments:

#Scenario 1
table1 <- data.table(id=c("A", "B", "C"), price=c(12,11,10)); table2 <- data.table(id=c("A", "C", "C", "D"), wharehouse=c("Colorado","Texas","New York", "Oregon"));

#Scenario 2
table1 <- data.table(id=c("C", "C", "C"), price=c(12,11,10)); table2 <- data.table(id=c("A", "C", "C", "D"), wharehouse=c("Colorado","Texas","New York", "Oregon"));

#Scenario 3
table1 <- data.table(id=c(NA, "C", "C"), price=c(12,11,10)); table2 <- data.table(id=c("A", "C", "C", NA), wharehouse=c("Colorado","Texas","New York", "Oregon")) 

#Scenario 4
table1 <- data.table(id=c("A", "A", "C"), price=c(12,11,10)); table2 <- data.table(id=c("B", "C", "C","D"), wharehouse=c("Colorado","Texas","New York", "Oregon")) 

setkeyv(table1,"id")
setkeyv(table2,"id")
table_join  <- merge(table1,table2,by="id",all.x=T,all.y=T)

write_description <- function(p,w,n) { 
  inners <- (!is.na(p) & !is.na(w))
  lefts <-  (!is.na(p) & is.na(w))
  rights <- ((is.na(p) & !is.na(w))) | (n > 1 & !is.na(w))
  multis <- n > 1

  paste0(ifelse(inners,"INNER JOIN ",""),
         ifelse(multis,"MULTIPLE ",""),
         ifelse(lefts,"LEFT JOIN ",""),
         ifelse(rights,"RIGHT JOIN ",""))
}


table_join[,description:=write_description(price,wharehouse,.N),by="id"]

Results SCENARIO 1:

> table_join
   id price wharehouse                     description
1:  A    12   Colorado                     INNER JOIN 
2:  B    11         NA                      LEFT JOIN 
3:  C    10      Texas INNER JOIN MULTIPLE RIGHT JOIN 
4:  C    10   New York INNER JOIN MULTIPLE RIGHT JOIN 
5:  D    NA     Oregon                     RIGHT JOIN 

Results: SCENARIO 2

> table_join
   id price wharehouse                     description
1:  A    NA   Colorado                     RIGHT JOIN 
2:  C    12      Texas INNER JOIN MULTIPLE RIGHT JOIN 
3:  C    12   New York INNER JOIN MULTIPLE RIGHT JOIN 
4:  C    11      Texas INNER JOIN MULTIPLE RIGHT JOIN 
5:  C    11   New York INNER JOIN MULTIPLE RIGHT JOIN 
6:  C    10      Texas INNER JOIN MULTIPLE RIGHT JOIN 
7:  C    10   New York INNER JOIN MULTIPLE RIGHT JOIN 
8:  D    NA     Oregon                     RIGHT JOIN 

Results: SCENARIO 3

> table_join
   id price wharehouse                     description
1: NA    12     Oregon                     INNER JOIN 
2:  A    NA   Colorado                     RIGHT JOIN 
3:  C    11      Texas INNER JOIN MULTIPLE RIGHT JOIN 
4:  C    11   New York INNER JOIN MULTIPLE RIGHT JOIN 
5:  C    10      Texas INNER JOIN MULTIPLE RIGHT JOIN 
6:  C    10   New York INNER JOIN MULTIPLE RIGHT JOIN 

Results: SCENARIO 4

> table_join
   id price wharehouse                     description
1:  A    12         NA             MULTIPLE LEFT JOIN 
2:  A    11         NA             MULTIPLE LEFT JOIN 
3:  B    NA   Colorado                     RIGHT JOIN 
4:  C    10      Texas INNER JOIN MULTIPLE RIGHT JOIN 
5:  C    10   New York INNER JOIN MULTIPLE RIGHT JOIN 
6:  D    NA     Oregon                     RIGHT JOIN 

Merge queries (Power Query) - Excel, One query result contains all columns from a primary table, with one column The related table contains all rows that match each row from a primary table With a merge, you can join two data queries that are in Excel or an external data source. Power Query analyzes each data source and classifies it into the defined� For example, you can use the target table as the source table, but in that case, the target table becomes another instance table that contains the row data as it was prior to the update to the target table by the MERGE operation (see “Example 18: Using the Target Table as the Source Table” on page 479).

Join (SQL), The result of the join can be defined as the outcome of first taking the Cartesian product (or Cross join) of all rows in the tables (combining every row in table A� Outer join: The combined table produced by an outer join contains all rows that existed in one table with blanks in the columns for the rows that did not exist in the second table. For instance, if table1 contains a row for Joe and a row for Sally, and table2 contains only a row for Sally, an inner join would contain only one row: the row for

Joining and Concatenating, A Joiner node joins two tables together on one or more common key values. On adult.csv data set: 1. calculate the average age and number of rows for the 4� First, from ?merge:. The rows in the two data frames that match on the specified columns are extracted, and joined together. If there is more than one match, all possible matches contribute one row each.

[PDF] MERGING vs. JOINING: Comparing the DATA Step , It focuses on merging or joining two data sets in any combination with either the each file are ordered according to the key field(s). The information in the matched records is combined to form one output record. Here is from the second file/table for the joined data values, the ID given in Exhibit 11 is often misleading. I have two tables with similar information. Let's call them items_a and items_b. They should be one, but they are coming from different sources, so they aren't. When I full-join the two table, some rows end up with data from either one or both tables. One of the columns in both tables is category_id.

Comments
  • Well, the devil is in the details. Check: table1 <- data.table(id=c("C", "C", "C"), price=c(12,11,10)) table2 <- data.table(id=c("A", "C", "C", "D"), wharehouse=c("Colorado","Texas","New York", "Oregon")) track.merge(table1,table2,"id")` or table1 <- data.table(id=c(NA, "C", "C"), price=c(12,11,10)) table2 <- data.table(id=c("A", "C", "C", NA), wharehouse=c("Colorado","Texas","New York", "Oregon")) track.merge(table1,table2,"id") And what happens when there is a column named join? But I think it is a good first draft
  • Check table1 <- data.table(id=c("A", "C", "C"), price=c(12,11,10)) table2 <- data.table(id=c("A", "C", "C"), wharehouse=c("Colorado","Texas","New York"))
  • @JonnyCrunch I've edited again. The custom column name feature makes the function much less readable, if not essential I'd consider removing it. Another way would be to return a character vector (unnamed). I see what I can do
  • It has been a long way, but thank you very much! The bounty is yours! BTW: I dont care about being slower, I will save a lot more time by knowing how my tables matched!!!
  • See my comment to Gerald, please.
  • In your 3rd scenario where id=NA does it imply you want to merge by id values that are NA or to omit these? The merge() function will treat these as key values and merge will be fine unless it's not desired behavior?
  • Looking over your additional scenarios, the approach holds as consistent with the example given in the structure() -- the question arises if a valid INNER join would seek to also indicate a LEFT join? In the example, merge on "A" yields only INNER, but this is also a valid LEFT join (for both "A" and "C"), but this is not articulated in structure() example?