Pandas DataFrame string replace followed by split and set intersection

pandas replace
pandas str split
pandas merge
pandas regex
pandas str extract
pandas str contains
pandas rename column
pandas apply

I have following pandas DataFrame

data = ['18#38#123#23=>21', '18#38#23#55=>35']
d = pd.DataFrame(data, columns = ['rule'])

and I have list of integers

r = [18, 55]

and I want to filter rules from above DataFrame if all integers in the list r are present in the rule too. I tried the following code and failed

d[d['rule'].str.replace('=>','#').split('#').astype(set).issuperset(set(r))]

How can I achieve the desired filtering with pandas

You were going in right direction, just need to use apply function instead:

d[d['rule'].str.replace('=>','#').str.split('#').apply(lambda x: set(x).issuperset(set(map(str,r))))]

Working with text data — pandas 1.1.0 documentation, Series and Index are equipped with a set of string processing methods that make it The performance difference comes from the fact that, for Series of type category , the string Elements in the split lists can be accessed using get or [] notation: The replace method also accepts a compiled regular expression object from� Intersection of two dataframe in pandas is carried out using merge() function. merge() function with “inner” argument keeps only the values which are present in both the dataframes. It will become clear when we explain it with an example. Intersection of two dataframe in pandas Python:

Using str.get_dummies

d.rule.str.replace('=>','#').str.get_dummies(sep='#').loc[:, map(str, r)].all(1)

Outputs

0    False
1     True
dtype: bool

Detail:

get_dummies+loc returns

    18  55
0   1   0
1   1   1

Python, Before calling .replace() on a Pandas series, .str has to be prefixed in boston is passed (with 'b' in lower case) and the case is set to False,� Let’s see how to split a text column into two columns in Pandas DataFrame. Method #1 : Using Series.str.split() functions. Split Name column into two different columns. By default splitting is done on the basis of single space by str.split() function.

My initial instinct would be to use a list comprehension:

df = pd.DataFrame(['18#38#123#23=>21', '188#38#123#23=>21', '#18#38#23#55=>35'], columns = ['rule'])

def wrap(n):
    return r'(?<=[^|^\d]){}(?=[^\d])'.format(n)

patterns = [18, 55]
pd.concat([df['rule'].str.contains(wrap(pattern)) for pattern in patterns], axis=1).all(axis=1)

Output:

0    False
1    False
2     True

Replace values in Pandas dataframe using regex, While working with large sets of data, it often contains text data and in many In this post, we will use regular expressions to replace strings which have some pattern to it. Search for opening bracket in the name followed by Replace NaN Values with Zeros in Pandas DataFrame � Split a String into� String or regular expression to split on. If not specified, split on whitespace. n int, default -1 (all) Limit number of splits in output. None, 0 and -1 will be interpreted as return all splits. expand bool, default False. Expand the split strings into separate columns. If True, return DataFrame/MultiIndex expanding dimensionality.

My approach is similar to @RafaelC's answer, but convert all string into int:

new_df = d.rule.str.replace('=>','#').str.get_dummies(sep='#')
new_df.columns = new_df.columns.astype(int)
has_all = new_df[r].all(1)

# then you can assign new column for initial data frame
d['new_col'] = 10
d.loc[has_all, 'new_col'] = 100

Output:

+-------+-------------------+------------+
|       |    rule           |   new_col  |
+-------+-------------------+------------+
|    0  | 18#38#123#23=>21  |      10    |
|    1  | 188#38#23#55=>35  |      10    |
|    2  | 18#38#23#55=>35   |     100    |
+-------+-------------------+------------+

How to use Regex in Pandas, There are several pandas methods which accept the regex in pandas rsplit(), Equivalent to str.rsplit() and Splits the string in the Series/Index from the end The regex checks for a dash(-) followed by a numeric digit (represented by d) and replace that with an empty string and the inplace parameter set as� Determines if replace is case sensitive: If True, case sensitive (the default if pat is a string) Set to False for case insensitive. Cannot be set if pat is a compiled regex. flags int, default 0 (no flags) Regex module flags, e.g. re.IGNORECASE. Cannot be set if pat is a compiled regex. regex bool, default True

Indexing and Selecting Data — pandas 0.25.0.dev0+752.g49f33f0d , Here we construct a simple time series data set to use for illustrating the indexing String likes in slicing can be convertible to the type of the index and lead to natural slicing. The semantics follow closely Python and NumPy slicing. Generally, you can intersect the desired labels with the current axis, and then reindex. See the examples section for examples of each of these. value scalar, dict, list, str, regex, default None. Value to replace any values matching to_replace with. For a DataFrame a dict of values can be used to specify which value to use for each column (columns not in the dict will not be filled).

How to split a text column into two separate columns? Code Example, Python queries related to “How to split a text column into two separate columns?” pandas split string column in two � dataframe python split� Union function in pandas is similar to union all but removes the duplicates. union in pandas is carried out using concat() and drop_duplicates() function. It will become clear when we explain it with an example.Lets see how to use Union and Union all in Pandas dataframe python. Union and union all in Pandas dataframe Python:

pyspark.sql module — PySpark 2.1.0 documentation, Creates a DataFrame from an RDD, a list or a pandas. To do a SQL-style set union (that does deduplication of elements), use this function followed by a� Pandas is one of those packages and makes importing and analyzing data much easier. Pandas dataframe.replace() function is used to replace a string, regex, list, dictionary, series, number etc. from a dataframe. This is a very rich function as it has many variations.

Comments
  • how can I use boolean indexes returned by your method to set the value of a new column in the dataframe?
  • Your solution might not exactly answer OP's problem. It gives the same output for data = ['18#38#123#23=>21', '188#38#23#55=>35'], while the expected output would be False, False.