python regex: extract list elements, each of which matches multiple patterns

I am totally new to python. Suppose I have a list as follows.

 somelist =  
['AAAA  1234   SD OXD',
 'AAAB  2342   DF BDD',
 'ERTE  3454   RE DFD',
 'GWED  1234   SD TCD',
 'AAAA  2353   SD MKX',
 'VERD  1234   IO ERT']

And I would like to extract elements that match both '1234' at position 7-10 and 'SD' at position 14-15 (just an example, could be any combination of positions, with anything in between). The result would be as follows.

['AAAA  1234   SD OXD', 'GWED  1234   SD TCD']

What I am doing now is to nest a filter() function inside another.

x = filter(lambda x: re.match('1234', x[6:10]), filter(lambda r: re.match('SD', r[13:15]), somelist))

This works but looks rather chunky and dumb. Can someone help get a solution that's more elegant and faster? The list could contain millions of elements (from lines in a file).

There are many discussions about searching/matching any of the patterns/regexes (match A OR B). This is to match A AND B, which must be as common a problem as the OR problem. Apparently it's gonna get messy if I want to match A and B and C and ... at different locations.

Update: Thank you all. My original question was probably not clear enough. It's basically an 'element must match ALL of several patterns at given positions' question.

Inspired by Kcorlidy's response particularly, I gave it a few quick shots and these worked (and . indeed means 'anything', except \n according to the manual):

To match '1234' and 'SD' at said positions:

filter(lambda x: re.search(r'.{6}1234.{3}SD', x), somelist)

To match 'AAAA' and 'SD' at 0:4 and 13:15, respectively:

filter(lambda x: re.search(r'.{0}1234.{9}SD', x), somelist)

The take-home message is the numbers in the curly parentheses seem to mean 'distance' (number of characters) from the end of the previous pattern ('distance' from the beginning, i.e. ^, if it's for the first pattern), not the start position of concerned pattern. That's the whole key point. Simple stuff - that's probably why more are interested in the match A or B rather than this match A and B problem.

Why you used two regex, actually it can finish in one regex

import re

somelist = [ 
     'AAAA  1234   SD OXD',
     'AAAB  2342   DF BDD',
     'ERTE  3454   RE DFD',
     'GWED  1234   SD TCD',
     'AAAA  2353   SD MKX',
     'VERD  1234   IO ERT',
     'AAAA 2353   SD MKX',
     'AAAA  2353  SD MKX']

print(list(filter(lambda x : re.search(r".{6}1234\s{3}SD",x) ,somelist)))
# ['AAAA  1234   SD OXD', 'GWED  1234   SD TCD']

python regex: extract list elements, each of which , python regex: extract list elements, each of which matches multiple patterns x = filter(lambda x: re.match('1234', x[6:10]), filter(lambda r: re.match('SD', r[13:15])  Regular expressions, also called regex, is a syntax or rather a language to search, extract and manipulate specific string patterns from a larger text. In python, it is implemented in the re module. You will first get introduced to the 5 main features of the re module and then see how to create common regex in python.

Are you sure you need complex regex? You could also use:

[x for x in somelist if x[5:9] == '1234' and x[10:12] == 'SD' ]
# ['AAAA 1234 SD OXD', 'GWED 1234 SD TCD']

Pattern matching in Python with Regex, To create a Regex object that matches the phone number pattern, enter the While there are several steps to using regular expressions in Python, each step is fairly of multiple values, you can use the multiple-assignment trick to assign each Python | Check if string matches regex list · Python Regex to extract maximum  ##Regular Expressions. Regular expressions are a tool for matching text patterns in strings of varying length and content. Regexes give you the flexibility to run searches on/match patterns beyond literal fixed characters. The Python module that provides Regex support is called “re”. Search with the re.search() method:

I'm also not sure RegEx is the best solution, but this works if you do want that:

>>> regex = re.compile('.{6}1234   SD.*')
>>> x=re.findall("\n".join(somelist))
['AAAA  1234   SD OXD', 'GWED  1234   SD TCD']

Using Regex for Text Manipulation in Python, For instance, you may want to remove all punctuation marks from text documents Similarly, you may want to extract numbers from a text string. A Regular Expression is a text string that describes a search pattern which can be used to null will be returned since match function only matches the first element in the string. Regular Expression– Regular expression is a sequence of character(s) mainly used to find and replace patterns in a string or file. So we can say that the task of searching and extracting is so common that Python has a very powerful library called regular expressions that handles many of these tasks quite elegantly.

Python re.match, search Examples, Execute regular expressions with re: call match, search, split and findall. Python program that uses match import re # Sample strings. list = ["dog dot", "do don't", "dumb-dumb", "no match"] # Loop. for element in list: # Match if two words starting with letter But search attempts this at all possible starting points in the string. This article demonstrates how to use regex to substitute patterns by providing multiple examples where each example is a unique scenario in its own. It is very necessary to understand the re.sub() method of re (regular expression) module to understand the given solutions.

Python Scripting for Computational Science, 8.2.6 Extracting Multiple Matches In strings where a pattern may be repeated the regular expression contains a group, re.findall returns a list of all the matched the first group (the outer group in real_short) is of interest in each list element. Regular Expression (Regex — often pronounced as ri-je-x or reg-x) is extremely useful while you are about to do Text Analytics or Natural Language Processing.But as much as Regex is useful, it’s also extremely confusing and hard to understand and always require (at least for me) multiple DDGing with click and back to multiple Stack Overflow links.

pandas.Series.str.findall, Equivalent to applying re.findall() to all the elements in the Series/Index. All non-overlapping matches of pattern or regular expression in each string of this Series/Index. For each string in the Series, extract groups from all matches of regular more than once in the same string, then a list of multiple strings is returned:. Hi, here is a piece of pseudo-code (taken from Ruby) that illustrates the problem I'd like to solve in Python: str = 'abc' if str =~ /(b)/ # Check if str matches a pattern

Comments
  • yes, one compiled regex - that's exactly what I was looking for. getting close. but rather than \s, I'd like to make it general, something like position 6:10 matches pattern1 + anything + position 13:15 matches pattern2 + position... (I only care about such specific positions). So my exact question here is how to do the anything (not just whitespace; dot seems to mean 'anything' but I did not make it work) and how to combine those positions with the anythings.
  • ok, these worked: filter(lambda x: re.search(r'.{6}1234.{3}SD', x), somelist) for matching '1234' and 'SD' at said positions; and filter(lambda x: re.search(r'.{0}1234.{9}SD', x), somelist) for matching 'AAAA' and 'SD' at 0:4 and 13:15 respectively. Basically, I misunderstood the number in the second curly parentheses. I thought it meant start position of the second pattern while it seems to actually mean 'distance' (number of characters) from the end of the first pattern. That's the key point.
  • this worked fine. why is it bad to use regex? is it slower or just because of readability? the reason I would like to use regex is to be concise and to avoid for loops (of course lambda also seems essentially a for loop). for example, if I want to match quite a few "keys" for each element, then I have to do a lot of and if... for regex, I can just define various of search combinations with re.compile() and use them when needed. but you know better and I'd just take advice from you experienced guys.
  • It's not necessarily bad to use regex, but in the specific example you gave in the question, this option seemed much easier to me. I don't know your specific use case, so it may well be that RegEx becomes more readable there. But in your question that did not seem to be the case. If you need and if logic, you might also consider writing a function that selects which patterns to look for. But again, it's extremely hard to judge what is the best option, since I don't know the entire case.
  • Suppose the result should be: ['AAAA 1234 SD OXD', 'AAAA 2353 SD MKX'], how would you combine two parts separated by anything (in this case, '^AAAA + anything + .{13,15}SD.*')? To make it more general, I'd just specify fixed positions for each "matching key" (something like anything + {position1}pattern1 + anything + {position2}pattern2 + anything + {position3}pattern3 + ...). I tried '.{0,4}AAAA.{13,15}SD.*' and that did not work.
  • sorry. my original question was not clear enough. but it's basically an 'each element must match several patterns at respective position' question. that's what I meant by comparing the 'match A and B and ...' as opposed to 'match A or B or ...' in my original question. And that's why I was more interested in using regex...