Use contains to merge data frame

pandas merge
pandas merge on multiple columns
pandas merge on condition
pandas merge on different column names
merge on index pandas
pandas merge vs join
pandas left join
merge() missing 1 required positional argument right

I have two separates files, one from our service providers and the other is internal (HR).

The service providers write the names of our employer in different ways, there are those who write it in firstname lastname format, or first letter of the firstname and the last name or lastname firstname...while the HR file includes separately the first and last name.

DF1

Full Name
0   B.pitt
1   Mr Nickolson Jacl
2   Johnny, Deep
3   Streep Meryl

DF2

First   Last
0   Brad    Pitt
1   Jack    Nicklson
2   Johnny  Deep
3   Streep  Meryl

My idea is to use str.contains to look for the first letter of the first name and the last name. I've succed to do it with static values using the following code:

    df1[['Full Name']][df1['Full Name'].str.contains('B')
                       & df1['Full Name'].str.contains('pitt')]

Which gives the following result:

Full Name
0   B.pitt

The challenge is comparing the two datasets... Any advise on that please?

Regards

if you are just checking if it exists or no this could be useful: because it is rare to have 2 exactly the same family name, I recommend to just split your Df1 and compare families, then for ensuring you can differ first names too you can easily do it with a for:

for i in range('your index'):
 if df1_splitted[i].str.contain('family you searching for'):
  print("yes")

if you need to compare in other aspects just let me know

Python: combine str.contains and merge in pandas, contains and merge in pandas � python regex pandas dataframe merge. I have two dataframes that look somewhat like the following (the Content� I have two dataframe that I would like to merge based on if column value from df2 contains column value from df1. I've been trying to use str.contains and series.isin. But no luck so far. Example below. df1. Domain Visits aaa 1 bbb 3 ddd 5 df2

I suggest to use next module for parsing names:

pip install nameparser

Then you can process your data frames :

from nameparser import HumanName
import pandas as pd

df1 = pd.DataFrame({'Full Name':['B.pitt','Mr Nickolson Jack','Johnny, Deep','Streep Meryl']})
df2 = pd.DataFrame({'First':['Brad', 'Jack','Johnny', 'Streep'],'Last':['Pitt','Nicklson','Deep','Meryl']})

names1 = [HumanName(name) for name in df1['Full Name']]
names2 = [HumanName(str(row[0]+" "+ str(row[1]))) for i,row in df2.iterrows()]

After that you can try comparing HumanName instances which have parsed fileds. it looks like this:

<HumanName : [
    title: '' 
    first: 'Brad' 
    middle: '' 
    last: 'Pitt' 
    suffix: ''
    nickname: '' ]

I have used this approach for processing thousands of names and merging them to same names from other documents and results were good.

More about module can be found at https://nameparser.readthedocs.io/en/latest/

Combining DataFrames with Pandas – Data Analysis and , Combine data from multiple files into a single DataFrame using merge and to combine DataFrames is to use columns in each dataset that contain common� These methods perform significantly better (in some cases well over an order of magnitude better) than other open source implementations (like base::merge.data.frame in R). The reason for this is careful algorithmic design and the internal layout of the data in DataFrame. See the cookbook for some advanced strategies.

Hey you could use fuzzy string matching with fuzzywuzzy

First create Full Name for df2

df2_ = df2[['First', 'Last']].agg(lambda a: a[0] + ' ' + a[1], axis=1).rename('Full Name').to_frame()

Then merge the two dataframes by index

merged_df = df2_.merge(df1, left_index=True, right_index=True)

Now you can apply fuzz.token_sort_ratio so you get the similarity

merged_df['similarity'] = merged_df[['Full Name_x', 'Full Name_y']].apply(lambda r: fuzz.token_sort_ratio(*r), axis=1)

This results in the following dataframe. You can now filter or sort it by similarity.

     Full Name_x    Full Name_y        similarity
0    Brad Pitt      B.pitt             80
1    Jack Nicklson  Mr Nickolson Jacl  80
2    Johnny Deep    Johnny, Deep       100
3    Streep Meryl   Streep Meryl       100

Pandas : How to Merge Dataframes using Dataframe.merge() in , First of all, let's create two dataframes to be merged. Dataframe 1: This dataframe contains the details of the employees like, ID, name, city,� Using rbind() to merge two R data frames. We’ve encountered rbind() before, when appending rows to a data frame. This function stacks the two data frames on top of each other, appending the second data frame to the first. For this function to operate, both data frames need to have the same number of columns and the same column names. Using Merge to join Two Data Frames by A Common Field

Merge Data Frames in R: Full and Partial Match, The values that are not match won't be return in the new data frame. The partial To join two datasets, we can use merge() function. We will� Pandas’ merge function can automatically detect which columns are common between the data frames and use the common column to merge the two data frames. The new merged data frame has the just two items that are common to both the data frame. Inner Merge Two Data Frames in Pandas Inner Join with Pandas Merge

pandas.merge — pandas 1.0.5 documentation, Merge DataFrame or named Series objects with a database-style join. left: use only keys from left frame, similar to a SQL left outer join; preserve key order. If you try to combine two datasets, the first thing to do is to decide whether to use merge or concat. There is a simple rule to find the right answer. If the content of the dataframe is relevant to combine the dataframes, you must select merge, otherwise you can take concat: Let´s start with Concat

Merge, join, and concatenate — pandas 1.0.5 documentation, objs : a sequence or mapping of Series or DataFrame objects. If a dict is Check whether the new concatenated axis contains duplicates. This can be very left_on : Columns or index levels from the left DataFrame or Series to use as keys . In R you use the merge() function to combine data frames. This powerful function tries to identify columns or rows that are common between the two different data frames. How to use merge to find the intersection of data The simplest form of merge() finds the intersection between two different sets of data.

Comments
  • What is you desired output?
  • DataFrame with the columns of the two dataframes.
  • Can you please provide the actual dataframe look
  • can't DF1 and DF2 help?
  • That's an excellent solutions, i tested it and it works, but it would take time for me to process it, as i'm new to Python. Thanks for your help
  • Feel free to ask more specific questions about the solution if you need help processing it