Match word (starting with plus symbol) in pandas data frames

pandas dataframe filter string startswith
pandas does not start with
search for string in dataframe pandas
string pattern matching pandas
pandas startswith regex
python str.contains exact match
pandas regex
check if string is in pandas dataframe

I have two pandas data frames. I would like to find matching strings in one specific column ("keyword") exist in both data frames.

keyword                     adGroup     goal6Value   adCost
[aaaa]                      (not set)   0            0.0
+bb +bb                     (not set)   0            0.0
+cc +cc                     (not set)   2072         0.0
[dddd]                      (not set)   0            0.0

The second data frame:

keyword                     status      Max          Min
[aaaa]                      (not set)   0.1          0.0
+bb +bb                     (not set)   0.2          0.0
+ff +ff                     (not set)   0.1          0.0
[gggg]                      (not set)   0.3          0.0

I would like the output to return all columns if the keyword is available in both data frames (keyword column). The output should look like this:

keyword    status       Max     Min    adGroup    goal6Value   adCost
[aaaa]    (not set)     0.1     0.0   (not set)   0            0.0
+bb +bb   (not set)     0.2     0.0   (not set)   0            0.0

I have changed the data type for keyword column into string for both data frames. I have tried these options:

pd.merge(df1, df2, on='keyword')

and

df1.set_index('keyword').join(df2.set_index('keyword'))

However, both options only matched the keyword with brackets and did not return the keywords starting with a plus symbol even when they are available in both data frames.

Is there a way to match the keyword with the plus symbol as well in pandas?


I cannot recreate your issue, the below test works fine. I'd suggest casting your keyword column as dtype object in both dataframes (df1['keyword'] = df1['keyword'].astype(object) | df2['keyword'] = df2['keyword'].astype(object))

dtype object seems to work for me, as shown below:

data_1 = {'keyword': ['[aaaa]','+bb +bb','+cc +cc','[dddd]'],
          'adGroup': ['(not set)','(not set)','(not set)','(not set)'],
          'goal6Value': ['0','0','2072','0'],
          'adCost': ['0.0','0.0','0.0','0.0']}

data_2 = {'keyword': ['[aaaa]','+bb +bb','+ff +ff','[gggg]'],
          'status': ['(not set)','(not set)','(not set)','(not set)'],
          'Max': ['0.1','0.2','0.1','0.3'],
          'Min': ['0.0','0.0','0.0','0.0']}

df_1 = pd.DataFrame(data_1)
df_2 = pd.DataFrame(data_2)

test = pd.merge(df_1, df_2, on='keyword')
test.head()

keyword adGroup goal6Value  adCost  status  Max Min
0   [aaaa]  (not set)   0   0.0 (not set)   0.1 0.0
1   +bb +bb (not set)   0   0.0 (not set)   0.2 0.0

test.dtypes

keyword       object
adGroup       object
goal6Value    object
adCost        object
status        object
Max           object
Min           object
dtype: object

Alternatively, perhaps there are some leading/lagging spaces on your keyword column that may not exist across dataframes. This can be remedied with Pandas.series.str.strip(). Pandas docs.

Match word (starting with plus symbol) in pandas data frames, Match word (starting with plus symbol) in pandas data frames. pandas regex extract search for string in dataframe pandas pandas regex match pandas  One final comment is that your solution matches partial words, so employee Tom Sawyer would match "Tom" to the vendor "Atomic S.A.". The regex function I provided here will not give this as a match, should you want to do this the regex would become a little more complicated.


I could not recreate the issue as I could merge the two dfs beblow

df1=pd.DataFrame({'keyword':['[aaaa]','+bbbb'],'adGroup':['something','something']})
df2=pd.DataFrame({'keyword':['[aaaa]','+bbbb'],'adGroup':['something2','something2']})
df1.merge(df2,on='keyword')

    adGroup_x   keyword adGroup_y
0   something   [aaaa]  something2
1   something   +bbbb   something2

May be you need to change the type.

pandas.Series.str.startswith, Object shown if element tested is not a string. Returns. Series or Index of bool. A Series of booleans indicating whether the given pattern matches the start of each​  pandas.Series.str.match¶ Series.str.match (self, pat, case=True, flags=0, na=nan) [source] ¶ Determine if each string matches a regular expression. Parameters pat str. Character sequence or regular expression. case bool, default True. If True, case sensitive. flags int, default 0 (no flags) Regex module flags, e.g. re.IGNORECASE. na default NaN


EDITED

pd.merge work fine, I can't reproduce the problem, too

pd.merge(df1, df2, on='keyword')

Pandas Cheat Sheet, Download a free pandas cheat sheet to help you work with data in Python. list of our free Python tutorials; many of them make use of pandas in addition to other Python Parses an html URL, string or file and extracts tables to a list of dataframes Use these commands to combine multiple dataframes into a single one. Let’s see how to get all rows in a Pandas DataFrame containing given substring with the help of different examples. But this result doesn’t seem very helpful, as it returns the bool values with the index. Let’s see if we can do something better. Code #3: Filter all rows where either Team contains ‘Boston’ or College contains ‘MIT’.


Tutorial: Python Regex (Regular Expressions) for Data Scientists, In this Python regex tutorial, learn how to use regular expressions and the pandas library Emails always contain an @ symbol, so we start with it. While re.findall() matches all instances of a pattern in a string and returns them in a list, In addition to re and pandas , we'll import Python's email package as well, which will  Dealing with Rows and Columns in Pandas DataFrame A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns. We can perform basic operations on rows/columns like selecting, deleting, adding, and renaming.


How To Select Columns Using Prefix/Suffix of Column Names in , How to Select Columns with Prefix in Pandas Python. Selecting one or more columns from a data frame is straightforward in Pandas. However, you may know that the column names start with some prefix or end Basically, we need to do some kind of pattern matching to identify the columns of interest. Pandas Series.str.match() function is used to determine if each string in the underlying data of the given series object matches a regular expression. Syntax: Series.str.match(pat, case=True, flags=0, na=nan) Parameter : pat : Regular expression pattern with capturing groups. case : If True, case sensitive


How to Concatenate Column Values in Pandas DataFrame, To start, you may use this template to concatenate your column values (for strings only): Notice that the plus symbol ('+') is used to perform the concatenation. '​TypeError: ufunc 'add' did not contain a loop with signature matching types using a single DataFrame; Example 2: Concatenating two DataFrames; Example 3:  In this article, we will cover various methods to filter pandas dataframe in Python. Data Filtering is one of the most frequent data manipulation operation. It is similar to WHERE clause in SQL or you must have used filter in MS Excel for selecting specific rows based on some conditions.