Efficient searching partial strings in Python lists

python check if string contains substring from list
python find similar strings in list
str.contains list of strings python
python find string in list and return index
find string in list of strings python
python check for partial string match in list
python filter list of strings regex
python check if list of words in string

Looking for an efficient method to search for partial strings in Python (3.6+) lists.

I have two lists. listA is a list of strings of pathname + a unique filename:

['/pathname/uniquestring.ext', '/pathname/uniquestring.ext', '/pathname/uniquestring.ext' ...]

(created using glob(), filenames are all given and already exist)

listB is a list of dictionaries. Each dictionary has the same set of keys, but unique values.

[{key1:value1, key2:value2}, {key1:value3, key2:value4}, ...]

(also already given)

One key:value pair in each dictionary in listB will have a value that is contained in one unique item in listA.

However, the position of the value as it appears in each item of listA is indeterminate.

What I wanted was: for each item in listB, find the item in listA that contains a substring matching the k:v pair in the dict, and create a new dict (or list of tuples) as a "lookup table" (the goal was to correct a corrupted exif creation date in a set of image files).

Example:

listA = ['/pathname/abdce_654321.ext', '/pathname/a3b4c5_123456.ext', '/pathname/cbeebie_645321_abcde.ext', ...]

listB = [{"id": "123456", "create_date": "23/05/2014"}, ...]

new_dict = {"/pathname/a3b4c5_123456.ext": "23/05/2014, ...}

I've got exactly what I want from a dict comp as follows:

{j:i['create_date'] for j in listA for i in listB  if i['id'] in j}

But, even for my very small files (~5500 items) this takes 12s on my (admittedly rather old) laptop.

Presumably this is because I have to iterate over the whole of listB ~5500 times using my method.

Is there a more efficient way to do this in Python?

(nb i'm not seeking advice on how to correct exif data with python; this is a generalised q about string lookups in lists)

CORRECTIONS & CLARIFICATIONS

  1. I neglected to place quotes around the value '123456' in my example, implying of course that it is an integer; In the real-world data, it isn't, and nor are any of the equivalent values in the actual data I dealt with.
  2. The 'id' substring as it appears in a listA item is almost always delimited by underscores, but does not always appear in the same position in the whole string; So, performing a split('_') for instance on each item won't always place the 'id' string at position [-1] or [-2] or [-3], although [-1] would take care of ~80% of cases.
  3. All 'id's are unique, they do not appear more than once in either list; each filename is unique in listA; each 'id' never appears in more than one dictionary.

Thanks for the interest from everyone so far btw.

I can see what the two comments are getting at. The big question is: do we need to use in because that's only necessary if we don't know where the id appears in the path string? If it's always in a particular place we can extract it and use a constant-time lookup:

def extract_id(path):
    # todo
ids = {item['id']: item['create_date'] for item in listB}
new_dict = {path: ids[extract_id(path)] for path in listA}

which is only O(N) as opposed to your current O(N**2).

Python, I can see what the two comments are getting at. The big question is: do we need to use in because that's only necessary if we don't know where the id appears in​  Method #2 : Using filter() + lambda This function can also perform this task of finding the strings with the help of lambda. It just filters out all the strings matching the particular substring and then adds it in a new list.

First of all, here are generalised lists to help with testing:

listA = ['/pathname/abdce_%s.ext' % str(x) for x in range(10000)]

listB = [{'id': str(number), "create_date": "23/05/2014"} for number in range(10000)]

hello = {j: i['create_date'] for j in listA for i in listB if i['id'] in j}

Running that, with 10 000 values, took my machine 8.8 seconds on average. (9.5 seconds if I print the dictionary after)

Now if we compile that code to Cython (A python superset that runs on C), That time came down to 4.4 seconds for me.

See code below

cpdef dict main():
    cdef int x
    cdef int number
    cdef char j
    cdef dict i

    listA = ['/pathname/abdce_%s.ext' % str(x) for x in range(10000)]

    listB = [{'id': str(number), "create_date": "23/05/2014"} for number in range(10000)]

    hello = {j: i['create_date'] for j in listA for i in listB if i['id'] in j}

    return hello

Python, The classical problem that can be handled quite easily by Python and has been also dealt with many times is finding if a string is substring of other. The simple way to search for a string in a list is just to use ‘if string in list’. eg: But what if you need to search for just ‘cat’ or some other regular expression and return a list of the list items that match, or a list of selected parts of list items that … Continue reading Python: Searching for a string within a list – List comprehension →

Efficient search of string in a list of strings in Python – To Linux and , Using any function is the most classical way in which you can perform this task and also efficiently. This function checks for match in string with match of each  Python provides a number of functions for searching strings. Here are the most commonly used functions: count (str, beg= 0, end=len (string)): Counts how many times str occurs in a string. You can limit the search by specifying a beginning index using beg or an ending index using end. endswith (suffix, beg=0, end=len (string)): Returns True when a string ends with the characters specified by suffix.

Search Algorithms in Python, Efficient search of string in a list of strings in Python. Introduction. I'm currently working on a script that parses Suricata EVE log files and try to  In Python, this means that the sub list will contain all elements up to the 4th element, so we're actually calling: >>> BinarySearch([1,2,3,4], 3) which would return: 2 Which is the index of the element we are searching for in both the original list, and the sliced list that we pass on to the binary search algorithm.

Python: Check if String Contains Substring, a substring exists within a given string, or determine whether two Strings, Lists, Linear search is not often used in practice, because the same efficiency can be Assuming that we're searching for a value val in a sorted array, the algorithm algorithms we discussed will work just as well if we're searching for a String. So the script has two parts which are reading the log file and searching for the string in a list of strings. This list can be big with a target of around 20000 strings. Note: This post may seem trivial for real Python developers but as I did not manage to find any documentation on this here is this blog post.

Lists, mutability, and in-place methods, The easiest way to check if a Python string contains a substring is to use the in operator. method, and also works well for checking if an item exists in a list. The re module contains a function called search , which we can use to match a  Lists are just like the arrays, declared in other languages. Lists need not be homogeneous always which makes it a most powerful tool in Python. A single list may contain DataTypes like Integers, Strings, as well as Objects.

Comments
  • Can the ids inside the dictionaries repeat in other dictionaries?
  • Your sample 'id' value 123456 is an int, so the i['id'] in j test would fail here. For the id portion in the filename, is the id always delimited, either by underscores or the _ portion?
  • Can there be more than one entry in listB to match a filename in listA? If not, you could pop found elements from (a copy of) listB every time you find a match for a given filename.
  • @MartijnPieters - well, yes, but you'll have to forgive my oversight in not stringifying 123456; in the real-life case, the 'id' value is a string, and the code works perfectly.
  • Your answer has the same bug @Martijn Pieters pointed out in the OP's code.
  • @martineau Actually it doesn't, since the generated id is a string.