How to tell if one regular expression matches a subset of another regular expression?

regular expression examples
regex match more than one occurrence
regex match string containing word
regex match two specific characters
regex match any word
regex one or more characters
regex any number of characters
regex lookahead

I'm just wondering if it's possible to use one regular expression to match another, that is some sort of:

['a-z'].match(['b-x'])
True

['m-n'].match(['0-9'])
False

Is this sort of thing possible with regex at all? I'm doing work in python, so any advice specific to the re module's implementation would help, but I'll take anything I can get concerning regex.

Edit: Ok, some clarification is obviously in order! I definitely know that normal matching syntax would look something like this:

expr = re.compile(r'[a-z]*')
string = "some words"
expr.match(string)
<sRE object blah blah>

but I'm wondering if regular expressions have the capability to match other, less specific expressions in the non-syntacticly correct version I tried to explain with above, any letter from b-x would always be a subset (match) of any letter from a-z. I know just from trying that this isn't something you can do by just calling the match of one compiled expression on another compiled expression, but the question remains: is this at all possible?

Let me know if this still isn't clear.

I think — in theory — to tell whether regexp A matches a subset of what regexp B matches, an algorithm could:

  1. Compute the minimal Deterministic Finite Automaton of B and also of the "union" A|B.
  2. Check if the two DFAs are identical. This is true if and only if A matches a subset of what B matches.

However, it would likely be a major project to do this in practice. There are explanations such as Constructing a minimum-state DFA from a Regular Expression but they only tend to consider mathematically pure regexps. You would also have to handle the extensions that Python adds for convenience. Moreover, if any of the extensions cause the language to be non-regular (I am not sure if this is the case) you might not be able to handle those ones.

But what are you trying to do? Perhaps there's an easier approach...?

Using wildcards and Perl regular expressions, (*), which typically represents zero or more characters in a string of characters, and the question mark (?), which typically represents any one character. In Perl regular expressions, the '. ' character refers to any single character. Check if the two DFAs are identical. This is true if and only if A matches a subset of what B matches. However, it would likely be a major project to do this in practice. There are explanations such as Constructing a minimum-state DFA from a Regular Expression but they only tend to consider mathematically pure regexps. You would also have to handle the extensions that Python adds for convenience.

In addition to antinome's answer:

Many of the constructs that are not part of the basic regex definition are still regular, and can be converted after parsing the regex (with a real parser, because the language of regex is not regular itself): (x?) to (x|), (x+) to (xx*), character classes like [a-d] to their corresponding union (a|b|c|d) etc. So one can use these constructs and still test whether one regex matches a subset of the other regex using the DFA comparison mentioned by antinome.

Some constructs, like back references, are not regular, and cannot be represented by NFA or DFA.

Even the seemingly simple problem of testing whether a regex with back references matches a particular string is NP-complete (http://perl.plover.com/NPC/NPC-3COL.html).

Regex - Match any character or set of characters, Matches any character at second place in a 3 characters long string where string start with 'A' and ends with 'B'. By default, regular expressions will match any part of a string. It’s often useful to anchor the regular expression so that it matches from the start or end of the string: ^ matches the start of string. $ matches the end of the string.

Verification of the post by "antinome" using two regex : 55* and 5* :

REGEX_A: 55* [This matches "5", "55", "555" etc. and does NOT match "4" , "54" etc]

REGEX_B: 5* [This matches "", "5" "55", "555" etc. and does NOT match "4" , "54" etc]

[Here we've assumed that 55* is not implicitly .55.* and 5* is not .5.* - This is why 5* does not match 4]

REGEX_A can have an NFA as below:
  {A}--5-->{B}--epsilon-->{C}--5-->{D}--epsilon-->{E}
           {B} -----------------epsilon --------> {E} 
                          {C} <--- epsilon ------ {E}
REGEX_B can have an NFA as below:
  {A}--epsilon-->{B}--5-->{C}--epsilon-->{D}
  {A} --------------epsilon -----------> {D} 
                 {B} <--- epsilon ------ {D}
Now we can derive NFA * DFA of (REGEX_A|REGEX_B) as below:
  NFA:
  {state A}  ---epsilon --> {state B} ---5--> {state C} ---5--> {state D}
                                              {state C} ---epsilon --> {state D} 
                                              {state C} <---epsilon -- {state D}
  {state A}  ---epsilon --> {state E} ---5--> {state F}
                            {state E} ---epsilon --> {state F} 
                            {state E} <---epsilon -- {state F}

  NFA -> DFA:

       |   5          |  epsilon*
   ----+--------------+--------
    A  |  B,C,E,F,G   |   A,C,E,F
    B  |  C,D,E,F     |   B,C,E,F
    c  |  C,D,E,F     |   C
    D  |  C,D,E,F,G   |   C,D,E,F
    E  |  C,D,E,F,G   |   C,E,F
    F  |  C,E,F,G     |   F
    G  |  C,D,E,G     |   C,E,F,G

                    5(epsilon*)
    -------------+---------------------
              A  |  B,C,E,F,G 
      B,C,E,F,G  |  C,D,E,F,G 
      C,D,E,F,G  |  C,D,E,F,G 

    Finally the DFA for (REGEX_A|REGEX_B) is:
         {A}--5--->{B,C,E,F,G}--5--->{C,D,E,F,G}
                                     {C,D,E,F,G}---5--> {C,D,E,F,G}

         Note: {A} is start state and {C,D,E,F,G} is accepting state. 
Similarly DFA for REGEX_A (55*) is:
       |   5    |  epsilon*
   ----+--------+--------
    A  | B,C,E  |   A
    B  | C,D,E  |   B,C,E
    C  | C,D,E  |   C
    D  | C,D,E  |   C,D,E
    E  | C,D,E  |   C,E


            5(epsilon*)
   -------+---------------------
       A  |  B,C,E  
   B,C,E  |  C,D,E
   C,D,E  |  C,D,E

    {A} ---- 5 -----> {B,C,E}--5--->{C,D,E}
                                    {C,D,E}--5--->{C,D,E}
Note: {A} is start state and {C,D,E} is accepting state
Similarly DFA for REGEX_B (5*) is:
       |   5    |  epsilon*
   ----+--------+--------
    A  | B,C,D  |   A,B,D
    B  | B,C,D  |   B
    C  | B,C,D  |   B,C,D
    D  | B,C,D  |   B,D


            5(epsilon*)
   -------+---------------------
       A  |  B,C,D  
   B,C,D  |  B,C,D

    {A} ---- 5 -----> {B,C,D}
                      {B,C,D} --- 5 ---> {B,C,D}
Note: {A} is start state and {B,C,D} is accepting state
Conclusions:
DFA of REGX_A|REGX_B identical to DFA of REGX_A 
      -- implies REGEX_A is subset of REGEX_B
DFA of REGX_A|REGX_B is NOT identical to DFA of REGX_B 
      -- cannot infer about either gerexes.

Regular Expression Backreferences, is when you wish to look for adjacent, repeated words in some text. The first part of the match could use a pattern that extracts a single word. Regular expressions are a system for describing compl This tutorial will cover how to use regular expressions to explore the power of the 'grep' command. Grep is a tool used to search for specified patterns within text input using regular expressions.

You should do something along these lines:

re.match("\[[b-x]-[b-x]\]", "[a-z]")

The regular expression has to define what the string should look like. If you want to match an opening square bracket followed by a letter from b to x, then a dash, then another letter from b to x and finally a closing square bracket, the solution above should work.

If you intend to validate that a regular expression is correct you should consider testing if it compiles instead.

Regular Expression Reference: Capturing Groups and Backreferences, Put another way, suppose L(A) is the language of expression A (ie the set of all strings that A matches, which may be infinitely many strings). Then I want to know​  You need to use an “escape” to tell the regular expression you want to match it exactly, not use its special behaviour. Like strings, regexps use the backslash, \, to escape special behaviour. So to match an., you need the regexp \.. Unfortunately this creates a problem.

It's possible with the string representation of a regex, since any string can be matched with regexes, but not with the compiled version returned by re.compile. I don't see what use this would be, though. Also, it takes a different syntax.

Edit: you seem to be looking for the ability to detect whether the language defined by an RE is a subset of another RE's. Yes, I think that's possible, but no, Python's re module doesn't do it.

Possible to determine if one regex is a "subset" of another?, So if you have a regex in which a lookahead is followed by another piece of at the current position, we know that \b matches and that the first \w* matches 6  Regular expressions are definitely a trade worth learning. They play a big role in modern data analytics. For a good table of metacharacters, quantifiers and useful regular expressions, see this microsoft page. Remember, in R you have to double escape metacharacters! That’s all for now. Cheers for reading!

Regex to Test The Same Part of The String for More Than One , An overview of the approach taken to solve the problem. •. How to determine whether a regexp is a subset, superset or is disjoint from another regexp. You can use regular expressions with findstr /R switch. Typical command would be as below. findstr /R pattern filename.txt. Here the pattern can be specified using regular expressions. Search for the occurrence of all words ending with ‘xyz’ in a file. findstr /R [a-z]*xyz filename.txt. Search for text in all the files in a current directory.

[PDF] Pairing Strings to their Most Specific Regular Expression Match, The sequences in this file are actually a subset of putative transcripts, to extract lines that match > (the header lines), and then using another grep to extract those A regular expression is a syntax for describing pattern matching in strings. An asterisk modifies the preceding pattern so that it matches if it occurs zero or  Subset in a regular expression. Thanks for contributing an answer to Mathematics Stack Exchange! Is one regular language subset of another? 0.

Patterns (Regular Expressions) – A Primer for Computational Biology, Searches string for the regular expression exp . If a parameter matchVar is given, then the substring that matches the regular Matches any character *NOT* a member of the set of characters following the ^. If a match is found to the portion of a regular expression enclosed within parentheses, regexp will copy the subset  Regular expression is the regular expression for the string you would like to find, note that it must appear in quotation marks. regexm(string, "regular expression") For regexs, that is, to recall all or a portion of a string, the syntax is: regexs(n) Where n is the number assigned to the substring you want to extract. The substrings are actually divided when you run regexm.

Comments
  • You mean will one regex provide same or subset of matches as another?
  • Matching with re is written in 2 different forms: re.match(regex, string) or re.compile(regex).match(string). Could you please correct the code you provide because what you want to achieve is unclear.
  • Each regular expression matches a set of strings (an infinite set for some regexps). Do you want to know whether the two sets overlap? Or whether the second set is a subset of the first? (I'm not sure how to do either way but I think it needs to be clarified.)
  • Do you mean regular expressions equivalence?
  • this library claims support for mathematical regexes, which would mean that you can do union on them: leafstorm/lexington
  • The extensions are not just for convenience, and they make the problem undecidable.
  • that link doesn't work for me. "You don't have permission to access /~jdonalds/331/lecture05.html on this server." :(
  • Here's an archive of the link: web.archive.org/web/20120702185839/http://www.cs.oberlin.edu/…
  • Allowing back references probably makes the stated problem undecidable, but I have no proof of this.
  • Matching the actual expression strings is too hacky (every expression needs manual work), I'd rather use a library that implements real regular expressions, like in my answer.
  • I think the idea the OP is going for here is "pattern equality", though from the example given, what the OP considers True is not truly equality, as the second pattern matches a subset of the first.
  • If you restrict your self to actual mathematical regular expressions, it does work in theory and practice! See my answer!