Regex for text files with partially incomplete nested structure

I want to parse a file with this nested format:

/begin FUNCTION

    foo
    "1.2.12 foo_long"

    /begin DEF1
    /end DEF1
    FUNCTION_VERSION "1.2.0"

/end FUNCTION

/begin FUNCTION

    bar
    ""

/end FUNCTION

/begin FUNCTION

    urbi
    "10.15.23 urbi_long"

    /begin DEF1
    /end DEF1
    FUNCTION_VERSION "10.15.3"

/end FUNCTION

From this I want to extract the function names, the longnames and the version number.

I do this with the following regex:

sSearch = r'/begin FUNCTION\s+(\w*)\s+"[\d\._\s]*([^"]+)*"(.*?)FUNCTION_VERSION\s+"([^"]+)"\s+/end FUNCTION'
lMatches = re.findall(sSearch, sFileContent, re.S)
dMatches = {args[0]: [args[3], args[1]] for args in lMatches if args}
print(dMatches)

This leads to:

{'foo': ['1.2.0', 'foo_long'], 'bar': ['10.15.3', '']}

The function version from urbi is wrongly assigned to bar. I do not want bar returned at all as it does not contain a function version.

How can I adapt the regex that it releases the /begin FUNCTION occurance before bar when end FUNCTION is found without a leading function version?

I would want the output to be:

{'foo': ['1.2.0', 'foo_long'], 'urbi': ['10.15.3', 'urbi_long']}

P.S. What I also find confusing is why I need to add an unnecessary (.*?) capturing group in the middle. Should it not also work with a simple .*?

Regular Expressions (also called Regex or Regexp) is a pattern in which the rules for matching text are written in form of metacharacters, quantifiers or plain text. They are strings in which “what to match” is defined or written.

You can do this using a negative lookahead as follows:

import re

with open('filename.txt') as fd:
    data = fd.read()

regex = re.compile(
    r'begin\s+FUNCTION\s+([a-zA-Z_]+)\s+'
    r'(?:"[\d.]+\d\s+([a-zA-Z_]+)")?'
    r'(?:(?:(?!/end\s+FUNCTION).)+FUNCTION_VERSION\s+"([\d.]+\d)")?',
    re.MULTILINE | re.DOTALL
)
result = {i[0]: [i[2], i[1]] for i in regex.findall(data)}
print(result)

#outpout
{'urbi': ['10.15.3', 'urbi_long'], 'foo': ['1.2.0', 'foo_long'], 'bar': ['', '']}

#refine result
result = {k: [i for i in v if i] for k, v in result.items()}
print(result)

# output
{'urbi': ['10.15.3', 'urbi_long'], 'foo': ['1.2.0', 'foo_long'], 'bar': []}

Regular expressions (regex or regexp) are extremely useful in extracting information from any text by searching for one or more matches of a specific search pattern (i.e. a specific sequence of

This is one approach using Lookbehind & Lookahead.

Demo:

import re

s = """/begin FUNCTION

    foo
    "1.2.0 foo_long"

    /begin DEF1
    /end DEF1
    FUNCTION_VERSION "1.2.0"

/end FUNCTION

/begin FUNCTION

    bar
    ""

/end FUNCTION

/begin FUNCTION

    urbi
    "10.15.3 urbi_long"

    /begin DEF1
    /end DEF1
    FUNCTION_VERSION "10.15.3"

/end FUNCTION"""

result = {}
for i in re.findall(r"(?<=/begin FUNCTION)(.*?)(?=/end FUNCTION)", s, flags=re.DOTALL):
    val = i.strip().splitlines()
    if val:
        try:
            result[val[0]] = val[1].replace('"', "").split()
        except:
            result[val[0]] = []
print(result)

Output:

{'urbi': ['10.15.3', 'urbi_long'], 'foo': ['1.2.0', 'foo_long'], 'bar': []}

This is not a new part of a regular expression engine - you would find the same to exist with almost the same syntax in languages like Python and PHP. But the .NET regular expression engine uses the principle of named capturing to provide features for the developer to create a regular expression which is capable of matching nested constructions.

7.2. re — Regular expression operations¶. This module provides regular expression matching operations similar to those found in Perl. Both patterns and strings to be searched can be Unicode strings as well as 8-bit strings.

The recursion in regular expressions is the only way to allow the parsing of HTML code with nested tags of indefinite depth. It seems it's not yet a spreaded practice; not so much contents are available on the web regarding regexp recursion, and until now no user contribute notes have been published on this manual page.

If a nested list is not the best option, I am open to any other solution (for a dictionary, I would need to join fields to create a key, as there is no unique key without combining Route_Date). If anyone has a solid resource for handling a wide range of CSV use cases with Python a recommendation would be great.

Comments
  • It would be better if you post the desired outputs too
  • @Gurman: You are right, I have added the desired output.
  • Is THIS what you wanted?
  • @Gurman: No. That is exactly what I had. bar is returned instead of urbi.
  • @Gurman: Yes that looks good. Post an answer and I will accept it.
  • Strangely this breaks when the longname is in brackets: "1.2.0 (foo_long)". Then it does not find the occurance any more and exhibits the same problem as in my question but in another place...
  • That is because the 2nd capture group only captured \w+ which does not include parenthesis. If you change (\w+) to ([\w()]+), it should work.
  • Since I also want to enable German special characters I seem to have to change it to ?([^"]+). However, this does not strip the (maybe wrong) version number from the beginning of the text.
  • No.This still does not allow ().
  • Nevermind. My example contains multiple words for the longname that makes things even more complicated. However, I would ask a new question for this if I can't figure it out. Thanks for your help!
  • That is also an interesting solution. However, it also returns the first part of the longname string instead of the FUNCTION_VERSION field for the function version.
  • The Lookahead and Lookbehind looks good, but the information frmo the function version is not taken from the function_version field but from the longname textfield. Moreover, I do not feel confident with the very rigid and not content aware parsing after the regex. What happens when there is another tag after the function_version?