Filter pandas dataframe rows if any value on a list inside the dataframe is in another list

pandas isin
pandas match values in two dataframes
pandas check if value in dataframe
pandas isin multiple columns
pandas filter function
pandas dataframe filter by column value like
pandas find value in any column
pandas check if value in multiple columns

I have a pandas dataframe that contains a list in column split_categories:

df.head()

      album_id categories split_categories
    0    66562    480.494       [480, 494]
    1   114582        128            [128]
    2     4846          5              [5]
    3     1709          9              [9]
    4    59239    105.104       [105, 104]

I would like to select all the rows where the at least one category in a specific list [480, 9, 104].

Expected output:

  album_id categories split_categories
0    66562    480.494       [480, 494]
3     1709          9              [9]
4    59239    105.104       [105, 104]

I manage to do it using apply:

def match_categories(row):
    selected_categories =  [480, 9, 104]
    result = [int(i) for i in row['split_categories'] if i in selected_categories]
    return result

df['matched_categories'] = df.apply(match_categories, axis=1)

But this code runs on production and this way takes too long (I run it for multiple columns containing lists)

Is there a way to run something like:

df[~(df['split_categories'].anyvalue.isin([480, 9, 104]))]

Thanks

You can expand the inner list, and check if any items in the inner lists are contained in [480, 9, 104]:

l = [480, 9, 104]
df[df.categories.str.split('.', expand=True).isin(map(str,l)).any(axis=1)]

   album_id  categories split_categories
0     66562     480.494        [480,494]
3      1709       9.000              [9]
4     59239     105.104        [105,104]

Filter dataframe rows if value in column is in a set list of values , Suppose now we have a list of strings which we want the values in 'STK_ID' to end with, e.g. with the regex 'or' character | and pass the string to str.contains to filter the DataFrame: would match PANDAS , PanDAs , paNdAs123 , and so on​. so I am wondering if there is a way to use regex in isin (or another function),  To begin, I create a Python list of Booleans. I then write a for loop which iterates over the Pandas Series (a Series is a single column of the DataFrame). The Pandas Series, Species_name_blast_hit is an iterable object, just like a list. I then use a basic regex expression in a conditional statement, and append either True if ‘bacterium

You can convert each list to sets, get intersection and convert to bool:

L = [480, 9, 104]
mask = np.array([bool(set(map(int, x)) & set(L))  for x in df['split_categories']])

Or convert list column to DataFrame, cast to float and compare with isin:

df1 = pd.DataFrame(df['split_categories'].values.tolist(), index=df.index)
mask = df1.astype(float).isin(L).any(axis=1)

df = df[mask]
print (df)
  album_id categories split_categories
0    66562    480.494       [480, 494]
3     1709          9              [9]
4    59239    105.104       [105, 104]

Filter dataframe rows if value in column is in a set list of , To filter dataframe rows if the value in the column is in a setlist of values you can use the isin method which is shown how to do in the  Select Non-Missing Data in Pandas Dataframe With the use of notnull() function, you can exclude or remove NA and NAN values. In the example below, we are removing missing values from origin column. Since this dataframe does not contain any blank values, you would find same number of rows in newdf. newdf = df[df.origin.notnull()]

Use:

print(df[~(df['split_categories'].isin([480, 9, 104])).any()])

Output:

  album_id categories split_categories
0    66562    480.494       [480, 494]
3     1709          9              [9]
4    59239    105.104       [105, 104]

How To Filter Pandas Dataframe By Values of Column?, Let us first load gapminder data as a dataframe into pandas. How to Select Rows of Pandas Dataframe Based on a Single Value of a Column? One way to filter by rows in Pandas is to use boolean expression. Pandas dataframe's isin​() function allows us to select rows using a list or any iterable. If we  One way to filter by rows in Pandas is to use boolean expression. We first create a boolean variable by taking the column of interest and checking if its value equals to the specific value that we want to select/keep. For example, let us filter the dataframe or subset the dataframe based on year’s value 2002.

Avoid a series of lists

You can split into multiple numeric series and then use vectorised Boolean operations. Python-level loops using row-wise operations are generally less efficient.

df = pd.DataFrame({'album_id': [66562, 114582, 4846, 1709, 59239],
                   'categories': ['480.494', '128', '5', '9', '105.104']})

split = df['categories'].str.split('.', expand=True).add_prefix('split_').astype(float)
df = df.join(split)

print(df)
#    album_id categories  split_0  split_1
# 0     66562    480.494    480.0    494.0
# 1    114582        128    128.0      NaN
# 2      4846          5      5.0      NaN
# 3      1709          9      9.0      NaN
# 4     59239    105.104    105.0    104.0

L = [480, 9, 104]
res = df[df.filter(regex='^split_').isin(L).any(1)]

print(res)
#    album_id categories  split_0  split_1
# 0     66562    480.494    480.0    494.0
# 3      1709          9      9.0      NaN
# 4     59239    105.104    105.0    104.0

pandas.DataFrame.isin, If values is a DataFrame, then both the index and column labels must match. When values is a list check whether every value in the DataFrame is present in  Selecting those rows whose column value is present in the list using isin() method of the dataframe. Code #1 : Selecting all the rows from the given dataframe in which ‘Stream’ is present in the options list using basic method.

Another method:

my_list = [480, 9, 104]
pat = r'({})'.format('|'.join(str(i) for i in my_list))
#'(480|9|104)' <-- This is how the pat looks like
df.loc[df.split_categories.astype(str).str.extract(pat, expand=False).dropna().index]

Or:

pat = '|'.join(r"\b{}\b".format(x) for x in my_list)
df[df.split_categories.astype(str).str.contains(pat,na=False)]

    album_id    categories  split_categories
0   66562       480.494     [480, 494]
3   1709        9.000       [9]
4   59239       105.104     [105, 104]

This will work with both split_categories and categories column.

How to Filter Rows of a Pandas DataFrame by Column Value, How to Filter Rows of a Pandas DataFrame by Column Value Pandas is an open source Python library for data analysis. Following this, I convert the Boolean list into a Pandas Series and assigned it the variable name,  returns a Truefor each row the values are in the provided list. To filter the rows based on such a function, use the conditional function inside the selection brackets []. In this case, the condition inside the selection brackets titanic["Pclass"].isin([2,3])checks for

Get all rows in a Pandas DataFrame containing given substring , Code #1: Check the values PG in column Position. filter_none. edit import pandas as pd. # Creating the dataframe with dict of lists. df = pd. Code #3: Filter all rows where either Team contains 'Boston' or College contains 'MIT'. If you like GeeksforGeeks and would like to contribute, you can also write an article using  pandas.DataFrame.any¶ DataFrame.any (self, axis = 0, bool_only = None, skipna = True, level = None, ** kwargs) [source] ¶ Return whether any element is True, potentially over an axis. Returns False unless there at least one element within a series or along a Dataframe axis that is True or equivalent (e.g. non-zero or non-empty). Parameters

How to filter a Pandas DataFrame with a list by `in` or `not in` in Python, DataFrame is in the list values . Subset a DataFrame using this series to get a new DataFrame with only rows which contain a value in values . To get rows which  Python | Pandas DataFrame.fillna() to replace Null values in dataframe Pandas Dataframe.to_numpy() - Convert dataframe to Numpy array Drop rows from the dataframe based on certain condition applied on a column

Pandas cheat sheet, Pandas is one of the most popular tools for data analysis. You can import data in a data frame, join frames together, filter rows and columns and Filter rows where value is _not_ in a list: df Create a new column based on another column:. The filtered data can be of different types: a single value, a Series, or a DataFrame, as shown in the examples, respectively. When only a label or a list of labels is set, it will return all columns. Using Ranges. Another common method is using the ranges of row and column labels. Some examples are shown below.

Comments
  • What is the maximum size of a list in df['split_categories'], e.g. is it always 1 or 2 items?
  • df.split_categories.str.strip('[]') returns an array of NaN (the value inside split_categories is already a list (not a trsing) I used df[df.categories.str.split('.', expand=True).isin(map(str,l)).any(axis=1)] instead and it worked. thanks
  • Ohh I see, I understood you had to use split_categories instead, updated the answer
  • str.contains here is better
  • @jezrael i had tried, got the warning UserWarning: This pattern has match groups. To actually get the groups, use str.extract. :(
  • try pat = '|'.join(r"\b{}\b".format(x) for x in L)
  • @jezrael that works, thank you. :) will add in edit. :) still learning string formatting. :D