Capturing repeating subpatterns in Python regex

python regex repeat pattern n times
python regex non capturing group
regex repeat group
regex repeating pattern
regex match multiple occurrences
python regex multiple patterns
python regex repeating characters
python regex or operator

While matching an email address, after I match something like yasar@webmail, I want to capture one or more of (\.\w+)(what I am doing is a little bit more complicated, this is just an example), I tried adding (.\w+)+ , but it only captures last match. For example, yasar@webmail.something.edu.tr matches but only include .tr after yasar@webmail part, so I lost .something and .edu groups. Can I do this in Python regular expressions, or would you suggest matching everything at first, and split the subpatterns later?

re module doesn't support repeated captures (regex supports it):

>>> m = regex.match(r'([.\w]+)@((\w+)(\.\w+)+)', 'yasar@webmail.something.edu.tr')
>>> m.groups()
('yasar', 'webmail.something.edu.tr', 'webmail', '.tr')
>>> m.captures(4)
['.something', '.edu', '.tr']

In your case I'd go with splitting the repeated subpatterns later. It leads to a simple and readable code e.g., see the code in @Li-aung Yip's answer.

Repeating a Capturing Group vs. Capturing a Repeated Group, However, it no longer meets our requirement to capture the tag's label into the capturing group. When this regex matches !abc123!, the capturing group stores only  Capturing repeating subpatterns in Python regex 21 1

This will work:

>>> regexp = r"[\w\.]+@(\w+)(\.\w+)?(\.\w+)?(\.\w+)?(\.\w+)?(\.\w+)?"
>>> email_address = "william.adama@galactica.caprica.fleet.mil"
>>> m = re.match(regexp, email_address)
>>> m.groups()
('galactica', '.caprica', '.fleet', '.mil', None, None)

But it's limited to a maximum of six subgroups. A better way to do this would be:

>>> m = re.match(r"[\w\.]+@(.+)", email_address)
>>> m.groups()
('galactica.caprica.fleet.mil',)
>>> m.group(1).split('.')
['galactica', 'caprica', 'fleet', 'mil']

Note that regexps are fine so long as the email addresses are simple - but there are all kinds of things that this will break for. See this question for a detailed treatment of email address regexes.

Regular Expression HOWTO, Repetitions such as * are greedy; when repeating a RE, the matching engine will Groups indicated with '(' , ')' also capture the starting and ending index of the  Capturing repeating subpatterns in Python regex - Blogger 21 1

You can fix the problem of (\.\w+)+ only capturing the last match by doing this instead: ((?:\.\w+)+)

Regex Capture Groups and Back-References, For instance, the regex \b(\w+)\b\s+\1\b matches repeated words, such as regex Java, JavaScript, Python: no special syntax (use \10—knowing that if Group 10 is not You should think of these defined subpatterns as function calls: capture  If that expression matches, then self.repl = r'\1\2\3' replaces it – again, using back references – with the matches that were made capturing subpatterns using parentheses in the search pattern. So every matched part gets replaced by itself – except for the repeated character match \2, which does not have grouping parentheses.

This is what you are looking for:

>>> import re

>>> s="yasar@webmail.something.edu.tr"
>>> r=re.compile("\.\w+")
>>> m=r.findall(s)

>>> m
['.something', '.edu', '.tr']

Advanced Regex Tutorial—Regex Syntax, NET, Matthew Barnett's outstanding regex module for Python, whose features far Within a non-capturing group, you can still use capture groups. In Perl and PCRE, the syntax to repeat the pattern of Group 1 is (?1) (in Ruby 2+, it is \g<1>) # coding=utf8 # the above tag defines encoding for this document and is for Python 2.x compatibility import re regex = r"(# .+?\ )(.+)" test_str = "# Title\ ## Chapter\ ### sub-chapter#### The Bar\ It was a fall day.\ " subst = "\\1" # You can manually specify the number of replacements by changing the 4th argument result = re.sub(regex

Regular Expression Reference, The syntax and semantics of the regular expressions supported by PCRE are When a capturing subpattern is repeated, the value captured is the substring that Back references to named subpatterns use the Python syntax (?P=name). Perl, PHP, R, Python: Group Numbering with Subroutines and Recursion Some engines—such as Perl, PCRE (PHP, R, Delphi…) and Matthew Barnett's regex module for Python—allow you to repeat a part of a pattern (a subroutine) or the entire pattern (recursion).

Customize - Regex Library Editor, See information on capturing subpatterns with regular expressions. Also When a pattern contains an unlimited repeat inside a subpattern that can itself be For this, DataFlux uses (?P>name), which is an extension to the Python syntax that  Group 1 (\S+) is a straight capture group that captures the key. In the branch reset, the two sets of capturing parentheses allow you to capture different kinds of values in different formats to the same group, i.e. Group 2. You can check the group captures in the right pane of this online regex demo .

‍🤝‍ Capturing duplicate subpatterns in Python regex, re module does not support re-captures ( regex supports it): >>> m = regex.match​(r'([.\w]+)@((\w+)(\.\w+)+)', 'yasar@webmail.something.edu.tr') >>> m.groups()  Except for the fact that you can’t retrieve the contents of what the group matched, a non-capturing group behaves exactly the same as a capturing group; you can put anything inside it, repeat it with a repetition metacharacter such as *, and nest it within other groups (capturing or non-capturing).

Comments
  • Capturing repeated expressions was proposed in Python Issue 7132 but rejected. It is however supported by the third-party regex module.
  • @ToddOwen But, isn't this now possible in 2.7? I don't know when it became possible. But, the answer from stackoverflow.com/a/9765037/3541976 seems to work just fine for me in 2.7 using the re module.
  • @MichaelOhlrogge Issue 7132 is about what happens if the capturing parentheses are inside a repeat. The issue is not fixed, and will still only keep the last match. A possible workaround, as mentioned in the answer you linked to, is to put the capturing parentheses around a repeating pattern. (Note that (?: ...) are not capturing parentheses).
  • @ToddOwen Got it, thank you, that is a helpful clarification!
  • Out of curiosity, how do you write a replacement pattern when you match repeated captures? Does the meaning of \1, \2, \3 etc. change depending on how many times you matched (\.\w+)?
  • @Li-aung Yip: \1 corresponds to m.group(1); the meaning hasn't changed. You could use a function as a replacement pattern and call m.captures() in it.
  • In your example, the meaning of \1, \2, and \3 is obvious because they only capture once. But what is the meaning of \4, corresponding to (\.\w+)+? \4 appears to be "the last substring matched by the 4th capture group", in this case .tr.
  • @Li-aung Yip: m.groups() above explicitly shows what \4 is.
  • The meaning hasn't changed: \4 is m.group(4) whatever it is.
  • For abbreviations (if you've lower-cased): re.sub(ur'((?:[a-z]\.){2,})', lambda m: m.group(1).replace('.', ''), text)
  • Thanks. I was able adding parentheses allowed me to match a repeated subpattern, but then there was a group in the match with the last one of the pattern. I hadn't seen that (?: ...) makes a non-capturing group. docs.python.org/2/library/re.html#regular-expression-syntax Adding that fixes that problem.
  • Thank you@TimSwast this was exactly the comment and reference I needed!
  • This doesn't match for the yasar@webmail. As such, it could easily pick up false positive results where there are things other than email addresses with multiple periods separating them.
  • OP has clearly written that this is just an example and what he is trying to do is more complicated. Hence, my answer.
  • Yes, but the problem is that your solution won't work even on the simplified version of the problem OP gave. Your solution is trivially simple for anyone with even the most basic understanding of RegEx. All other answers are more complicated because this is a genuinely non-trivial problem to solve.