re.findall('(ab|cd)', string) vs re.findall('(ab|cd)+', string)

In a Python regular expression, I encounter this singular problem. Could you give instruction on the differences between re.findall('(ab|cd)', string) and re.findall('(ab|cd)+', string)?

import re

string = 'abcdla'
result = re.findall('(ab|cd)', string)
result2 = re.findall('(ab|cd)+', string)
print(result)
print(result2)

Actual Output is:

['ab', 'cd']
['cd']

I'm confused as to why does the second result doesn't contain 'ab' as well?

+ is a repeat quantifier that matches one or more times. In the regex (ab|cd)+, you are repeating the capture group (ab|cd) using +. This will only capture the last iteration.

You can reason about this behaviour as follows:

Say your string is abcdla and regex is (ab|cd)+. Regex engine will find a match for the group between positions 0 and 1 as ab and exits the capture group. Then it sees + quantifier and so tries to capture the group again and will capture cd between positions 2 and 3.


If you want to capture all iterations, you should capture the repeating group instead with ((ab|cd)+) which matches abcd and cd. You can make the inner group non-capturing as we don't care about inner group matches with ((?:ab|cd)+) which matches abcd

https://www.regular-expressions.info/captureall.html

From the Docs,

Let’s say you want to match a tag like !abc! or !123!. Only these two are possible, and you want to capture the abc or 123 to figure out which tag you got. That’s easy enough: !(abc|123)! will do the trick.

Now let’s say that the tag can contain multiple sequences of abc and 123, like !abc123! or !123abcabc!. The quick and easy solution is !(abc|123)+!. This regular expression will indeed match these tags. However, it no longer meets our requirement to capture the tag’s label into the capturing group. When this regex matches !abc123!, the capturing group stores only 123. When it matches !123abcabc!, it only stores abc.

7.2. re — Regular expression operations — Python 2.7.18 , See also the note about findall() . New in version 2.2. Changed in version 2.4: Added the optional flags argument. re. sub (pattern, repl, string,� re.findall (pattern, string, flags=0) ¶ Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.

I don't know if this will clear things more, but let's try to imagine what happen under the hood in a simple way, we going to sumilate what happen using match

   # group(0) return the matched string the captured groups are returned in groups or you can access them
   # using group(1), group(2).......  in your case there is only one group, one group will capture only 
   # one part so when you do this
   string = 'abcdla'
   print(re.match('(ab|cd)', string).group(0))  # only 'ab' is matched and the group will capture 'ab'
   print(re.match('(ab|cd)+', string).group(0)) # this will match 'abcd'  the group will capture only this part 'cd' the last iteration

findall match and consume the string at the same time let's imagine what happen with this REGEX '(ab|cd)':

      'abcdabla' ---> 1:   match: 'ab' |  capture : ab  | left to process:  'cdabla'
      'cdabla'   ---> 2:   match: 'cd' |  capture : cd  | left to process:  'abla'
      'abla'     ---> 3:   match: 'ab' |  capture : ab  | left to process:  'la'
      'la'       ---> 4:   match: '' |  capture : None  | left to process:  ''

      --- final : result captured ['ab', 'cd', 'ab']  

Now the same thing with '(ab|cd)+'

      'abcdabla' ---> 1:   match: 'abcdab' |  capture : 'ab'  | left to process:  'la'
      'la'       ---> 2:   match: '' |  capture : None  | left to process:  ''
      ---> final result :   ['ab']  

I hope this clears thing a little bit.

Python Regular Expressions | Python Education, findall() module is used to search for “all” occurrences that match a given pattern. In contrast, search() module will only return the first occurrence� re.findall() Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found.

So, for me confusing part was the fact that

If one or more groups are present in the pattern, return a list of groups;

docs

so it's returning you not a full match but only match of a capture. If you make this group not capturing (re.findall('(?:ab|cd)+', string), it'll return ["abcd"] as I initially expected

Python Regex: re.match(), re.search(), re.findall() with , re.findall(). Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are� The re.findall() helps to get a list of all matching patterns. It searches from start or end of the given string. If we use method findall to search for a pattern in a given string it will return all occurrences of the pattern. While searching a pattern, it is recommended to use re.findall() always, it works like re.search() and re.match() both.

Python Regex: re.search() VS re.findall(), The findall() function returns a list containing all matches. Example. Print a list of all matches: import re txt = "The rain in Spain� The re.findall() method returns a list of strings. Each string element is a matching substring of the string argument. Let’s check out a few examples! Examples re.findall() First, you import the re module and create the text string to be searched for the regex patterns:

Python RegEx, import re regex = ur"\[P\] (.+?) \[/P\]+?" line = "President [P] Barack Obama [/P] met Microsoft founder [P] Bill Gates [/P], yesterday." person� I really like this answer. If you want to process only matches then this does it without any extra statements like 1) save the list, 2) process the list isn't that equivalent to str = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher' ## Here re.findall() returns a list of all the found email strings emails = re.findall(r'[\w\.-]+@[\w\.-]+', str) ## ['alice@google.com', 'bob

Python regex findall, The re.findall(pattern, string) method finds all occurrences of the pattern in the string and returns a list of all matching substrings. Specification:. Equivalent to applying re.findall() to all the elements in the Series/Index. Parameters pat str. Pattern or regular expression. flags int, default 0. Flags from re module, e.g. re.IGNORECASE (default is 0, which means no flags). Returns Series/Index of lists of strings

Comments
  • re.findall('(ab|cd)', string) gets ['ab', 'cd'] re.findall('(ab|cd)+', string) gets ['cd']
  • can you link to some doc making clear the fact that + only captures the last iteration, and what is a capture group?
  • @Gulzar, updated the answer. You can read about capture groups here - regular-expressions.info/refcapture.html
  • @Shashank, thanks, your reply is exactly what I need. sincerely thanks
  • There's no need to surround the whole regex with brackets. Just '(?:ab|cd)+' will work.