Can you join dataframes with multiple keys in one of the joining columns?

I would like to join the following two dataframes.

The first dataframe has multiple keys in one column

>>> import pandas as pd
>>> df = pd.DataFrame(data={'col1': [1,2,3], 'key': ['x, y','y', 'z, x']})
>>> df
   col1   key
0     1  x, y
1     2     y
2     3  z, x

For each of the key in the first dataframe i have a mapping of sorts in a second dataframe. Like this:

>>> df2 = pd.DataFrame(data= {'key': ['x','y','z'], 'value': ["v1,v2, 
v3","v4,v3", "v5"]})

>>> df2
  key      value
0   x  v1,v2, v3
1   y      v4,v3
2   z         v5

I would like to end up with all values next to their corresponding keys in one column. Ideally with duplicates removed as in col1 (x and y both have v3).

>>> df3
   col1   key           value
0     1  x, y  v1, v2, v3, v4
1     2     y          v4, v3
2     3  z, x  v1, v2, v3, v5

Check with

d=dict(zip(df2.key,df2.value))
df['New']=[','.join([d.get(y) for y in x.split(', ')]) for x in df.key]

and now we remove the duplicate

df.New=df.New.str.split(',').apply(lambda x : ','.join(set(x)))
df
   col1   key          New
0     1  x, y  v3,v1,v2,v4
1     2     y        v3,v4
2     3  z, x  v5,v3,v1,v2

pandas: merge (join) two data frames on multiple columns, Can be a vector or list of vectors of the length of the DataFrame to use a particular vector as the join key instead of columns. right_on : label or� With outer joins, you’ll merge your data based on all the keys in the left object, the right object, or both. For keys that only exist in one object, unmatched columns in the other object will be filled in with NaN (Not a Number). You can also see a visual explanation of the various joins in a SQL context on Coding Horror.

Simple for loop

for k,v in zip(df2.key, df2.value): 
    df.key = df.key.str.replace(k,v)

Outputs

    col1    key
0   1       v1,v2, v3, v4,v3
1   2       v4,v3
2   3       v5, v1,v2, v3

To remove the duplicates, can transform

df.key.transform(lambda s: sorted(set([k.strip() for k in s.split(',')])))

    col1    key
0   1       [v1, v2, v3, v4]
1   2       [v3, v4]
2   3       [v1, v2, v3, v5]

Combining DataFrames with Pandas – Data Analysis and , Combine data from multiple files into a single DataFrame using merge and concat. Join DataFrames using common fields (join keys). We can use the concat function in pandas to append either columns or rows from one DataFrame to� Column or index level name (s) in the caller to join on the index in other, otherwise joins index-on-index. If multiple values given, the other DataFrame must have a MultiIndex. Can pass an array as the join key if it is not already contained in the calling DataFrame. Like an Excel VLOOKUP operation.

First we unnest (split) your values to rows:

df_new= pd.concat([pd.Series(row['col1'], row['key'].split(','))              
                    for _, row in df.iterrows()]).reset_index().rename({0:'col1', 'index':'key'},axis=1)

print(df_new)
  key  col1
0   x     1
1   y     1
2   y     2
3   z     3
4   x     3

Then we merge the values together on the key column and groupby to aggregate on col1:

df_final = pd.merge(df_new,df2, on='key',how='left')
df_final = df_final.groupby('col1').agg(', '.join).reset_index()

print(df_final)

   col1   key             value
0     1  x, y  v1,v2, v3, v4,v3
1     2     y             v4,v3
2     3  z, x     v5, v1,v2, v3

How to join two Pandas DataFrames on multiple columns in Python, How the data of one DataFrame is appended to another depends on several factors, including the content of the DataFrames and the method of merge used. Use� Joining on multiple keys; This lesson uses the same data from previous lessons, which was pulled from Crunchbase on Feb. 5, 2014. Learn more about this dataset. Joining on multiple keys. There are couple reasons you might want to join tables on multiple foreign keys. The first has to do with accuracy.

Merge, join, concatenate and compare — pandas 1.1.0 documentation, pd.concat(objs, axis=0, join='outer', ignore_index=False, keys=None, When gluing together multiple DataFrames, you have a choice of how to handle the The default behavior with join='outer' is to sort the other axis (columns in this case ). In you want to join on multiple columns instead of a single column, then you can pass a list of column names to Dataframe.merge () instead of single column name. Also, as we didn’t specified the value of ‘how’ argument, therefore by default Dataframe.merge () uses inner join.

pandas.merge — pandas 1.1.0 documentation, If joining columns on columns, the DataFrame indexes will be ignored. left: use only keys from left frame, similar to a SQL left outer join; preserve key order. 6. Spark Join on multiple DataFrame’s. When you need to join more than two tables, you either use SQL expression after creating a temporary view on the DataFrame or use the result of join operation to join with another DataFrame like chaining them. for example

Left & right merging on multiple columns, Here is an example of Left & right merging on multiple columns: You now have, to the revenue and managers DataFrames from prior exercises, a DataFrame sales that By merging revenue and sales with a right merge, you can identify the� This join syntax takes, takes right dataset, joinExprs and joinType as arguments and we use joinExprs to provide join condition on multiple columns. //Using multiple columns on join expression empDF.join(deptDF, empDF("dept_id") === deptDF("dept_id") && empDF("branch_id") === deptDF("branch_id"),"inner") .show(false) This example joins emptDF DataFrame with deptDF DataFrame on multiple columns dept_id and branch_id columns using an inner join. This example prints below output to console.

Comments
  • thanks a bunch. this worked perfectly. just had to fill NAs in the key column of the first df.
  • Like this concise answer, +1
  • Nice idea, but I guess splitting doesn't need a loop: df[['col1']].join(df['key'].str.split(',', expand=True).stack().droplevel(-1).rename('key'))