What the most efficient way to search substring through a Pandas DataFrame?

check if string is in pandas dataframe
pandas find value in any column
pandas select columns containing string
pandas dataframe filter by column value like

I have a Pandas Dataframe containing 75k rows of text (approx. 350 chars per row). I need to search the occurrences of a list of 45k substrings within that dataframe.

Expected output is a authors_data dict containing the list of authors and the number of occurrences. Following code assumes I have a dataframe['text'] column and a list of substrings named authors_list.

authors_data = {}
for author in authors_list:
    count = 0
    for i, row in df.iterrows():
         if author in row.text:
             count += 1
authors_data[author] = count
print(author, authors_data[author])

I ran some initial tests, 10 authors took me about 50 seconds. The complete table will take me a few days to run. So I'm looking at more time efficient ways to run the code.

Is df.iterrows() fast enough? Are there any specific libraries that I should look into?

Let me know!

I tried this and it's doing what you are looking for. You could test and see if it's faster.

for author in authors_list:
            authors_data[author] = df['AUTHORCOL'].map(lambda x: author in x).sum()

Python, Pandas str.find() method is used to search a substring in each string present in sub: String or character to be searched in the text value in series Example #2: Searching substring (More than one character) Python | Pandas Series.mul() · How to Do a vLookup in Python using pandas Python pandas-series-methods. Let’s see how to get all rows in a Pandas DataFrame containing given substring with the help of different examples. But this result doesn’t seem very helpful, as it returns the bool values with the index. Let’s see if we can do something better. Code #3: Filter all rows where either Team contains ‘Boston’ or College contains ‘MIT’.

#1 Delimited values

If your authors are clearly delineated, e.g. comma-separated in each series element, you can use collections.Counter with itertools.chain:

from collections import Counter
from itertools import chain

res = Counter(chain.from_iterable(df['Authors'].str.split(',').map(set)))

# Counter({'Frank Herbert': 1, 'George Orwell': 2, 'John Steinbeck': 1,
#          'John Williams': 2, 'Philip K Dick': 1, 'Philip Roth': 1,
#          'Ursula K Le Guin': 1})
#2 Arbitrary strings

Of course, such structured data isn't always available. If your series elements are strings with arbitrary data and your list of pre-defined authors is small, you can use pd.Series.str.contains.

L = ['George Orwell', 'John Steinbeck', 'Frank Herbert', 'John Williams']

res = {i: df['Authors'].str.contains(i, regex=False).sum() for i in L}

# {'Frank Herbert': 1, 'George Orwell': 2, 'John Steinbeck': 1, 'John Williams': 2}

This works because pd.Series.str.contains returns a series of Boolean values, which you can then sum since True is considered equivalent to 1 with most numeric computations in Python / Pandas. We turn off regex to improve performance.

Performance

Pandas string-based methods are notoriously slow. You can instead use sum with a generator expression and the in operator for an extra speed-up:

df = pd.concat([df]*100000)

%timeit {i: df['Authors'].str.contains(i, regex=False).sum() for i in L}    # 420 ms
%timeit {i: sum(i in x for x in df['Authors'].values) for i in L}           # 235 ms
%timeit {i: df['Authors'].map(lambda x: i in x).sum() for i in L}           # 424 ms

Notice, for scenario #1, Counter methods are actually more expensive because they require splitting as a preliminary step:

chainer = chain.from_iterable

%timeit Counter(chainer([set(i.split(',')) for i in df['Authors'].values]))  # 650 ms
%timeit Counter(chainer(df['Authors'].str.split(',').map(set)))              # 828 ms
Further improvements
  1. Solutions for scenario #2 are not perfect, since they won't differentiate (for example) between John Williams and John Williamson. You may wish to use a specialist package if this kind of differentiation is important to you.
  2. For both #1 and #2 you may wish to consider the Aho-Corasick algorithm. There is one example implementation, although more work may be required for a count of elements found within each row.

Setup

df = pd.DataFrame({'Authors': ['Ursula K Le Guin,Philip K Dick,Frank Herbert,Ursula K Le Guin',
                               'John Williams,Philip Roth,John Williams,George Orwell',
                               'George Orwell,John Steinbeck,George Orwell,John Williams']})

Get all rows in a Pandas DataFrame containing given substring , Let's see how to get all rows in a Pandas DataFrame containing given substring with the help of different examples. Code #1: Check the values PG in column  This is efficient, yet we are still paying for overhead for creating namedtuple. Using python zip. There is another interesting way to loop through the DataFrame, which is to use the python zip

Not a complete answer, but there are a few things you can do to make things faster:

-Use regular expressions: In fact create a pattern and then compile it E.g. Find out how many times a regex matches in a string in Python In your case you can compile only once each author.

-You have two loops. Supposing a reasonable number of authors, Put the smallest loop inside. You will be surprized how important this can be at times. This means, do the search for all authors before moving to the next line. 350 characters can fit to the cache of the CPU and if you are luck save you lots of times.

Taking things to the limit, but probably not that easy: a compiled pattern is an automaton that looks at each character of the input only once and recognizes the output (that is why you "compile" the patterns https://en.wikipedia.org/wiki/Deterministic_finite_automaton). You could create all automatons, and then take each character from the input and feed it to all automatons. Then you would only process each input character "once" ( times the non-constant size of authors)

Essential Basic Functionality, If a DataFrame or Panel contains homogeneously-typed data, the ndarray can actually See the section Recommended Dependencies for more installation info. I could be convinced to make the axis argument in the DataFrame methods (re.search style) as an argument, and so passing in a substring will work - as long  In terms of speed, python has an efficient way to perform filtering and aggregation. It has an excellent package called pandas for data wrangling tasks. Pandas has been built on top of numpy package which was written in C language which is a low level language.

An one-liner might be helpful.

authors_data = {author: df.text.map(lambda x: author in x).sum() for author in authors_list}

pandas.Series.str.findall, The search for the pattern 'Monkey' returns one match: When the pattern matches more than one string in the Series, all matches are returned:. Efficiently Store Pandas DataFrames. This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project tl;dr We benchmark several options to store Pandas DataFrames to disk. Good options exist for numeric data but text is a pain. Categorical dtypes are a good option.

Working with text data, Series and Index are equipped with a set of string processing methods that make it easy to operate on each element of the array. Perhaps most importantly, these methods exclude missing/NA values automatically. If using join='right' on a list-​like of others that contains different indexes, the union Extracting substrings¶  What do you mean by out of luck? My question is for the most efficient way to iterate over rows while keeping the dtype of each element intact. Are you suggesting df.iterrows(), or something else? sklearn is an exception, not the norm, that operates natively on PD's DataFrame. Not many machine learning libraries have APIs that operate on DataFrame.

How to check if Python string contains another string, There are three methods in Python to check if a string contains another string. 1. Use find. The find method checks if the string contains a substring. If it does, the  A small comparison of various ways to serialize a pandas data frame to the persistent storage. When working on data analytical projects, I usually use Jupyter notebooks and a great pandas library to process and move my data around. It is a very straightforward process for moderate-sized datasets which you can store as plain-text files without

String Manipulation and Regular Expressions, For basic manipulation of strings, Python's built-in string methods can be extremely convenient. index() / rindex() , and replace() methods are the best built-in methods. If you would like to find a substring and then split the string based on its in action at a larger scale, see Pandas: Labeled Column-oriented Data, where  Fun Fun Fun! 1. String commands. For string manipulations it is most recommended to use the Pandas string commands (which are Ufuncs).. For example, you can split a column which includes the full name of a person into two columns with the first and last name using .str.split and expand=True.

Comments
  • Can you show us an example of your dataframe? A few rows should suffice, including rows with multiple authors.
  • imo this is a prototypical opportunity for cython conversion a la pandas.pydata.org/pandas-docs/stable/enhancingperf.html. also, i'm pretty sure an iterrows() for loop is a bad-performing strategy and you're better off vectorizing with numpy or using list comprehension. but you should implement a few strategies and time it yourself.
  • A very helpful answer on the topic was removed for some reason. I'll post the answer that was given to me but yest, list comprehension gives spectacularly better results. I'll read your tutorial, it sounds like the kind of thing I need.
  • Glad I could help ! :)
  • My list of authors is rather clearly delineated so no worries here. I just iterate through a list of strings. And list comprehension is much faster than the methods I used previously. Thanks a lot.