Fuzzy Regular Expressions

In my work I have with great results used approximate string matching algorithms such as Damerau–Levenshtein distance to make my code less vulnerable to spelling mistakes.

Now I have a need to match strings against simple regular expressions such TV Schedule for \d\d (Jan|Feb|Mar|...). This means that the string TV Schedule for 10 Jan should return 0 while T Schedule for 10. Jan should return 2.

This could be done by generating all strings in the regex (in this case 100x12) and find the best match, but that doesn't seam practical.

Do you have any ideas how to do this effectively?

I found the TRE library, which seems to be able to do exactly fuzzy matching of regular expressions. Example: http://hackerboss.com/approximate-regex-matching-in-python/ It only supports insertion, deletion and substitution though. No transposition. But I guess that works ok.

I tried the accompanying agrep tool with the regexp on the following file:

TV Schedule for 10Jan
TVSchedule for Jan 10
T Schedule for 10 Jan 2010
TV Schedule for 10 March
Tv plan for March

and got

$ agrep -s -E 100 '^TV Schedule for \d\d (Jan|Feb|Mar)$' filename
1:TV Schedule for 10Jan
8:TVSchedule for Jan 10
7:T Schedule for 10 Jan 2010
3:TV Schedule for 10 March
15:Tv plan for March

Thanks a lot for all your suggestions.

regex � PyPI, A fuzzy regex specifies which types of errors are permitted, and, optionally, either the minimum and maximum or only the maximum permitted number of each� Fuzzy Searches; Proximity Searches; Regular Expressions; Limiting Your Search Based on Context . Intro. You can conduct wildcard, fuzzy, and proximity searches. The content search term also supports regular expression. This article will go over search syntax for advanced content searches on Everlaw. Return to table of contents. Wildcard Searches

An Introduction to Fuzzy String Matching | by Julien Tregoat, However, we use 'fuzzy' to describe life regularly, because life can be string matching, can be a variety of things; Regular expressions are a� In the paper, we introduce the concept of L-fuzzy regular expressions. This concept provides not only a necessary tool for the analysis and synthesis of fuzzy automata, but also forms a vehicle for a recursive generation of the family of fuzzy languages accepted by fuzzy automata from certain simple fuzzy languages.

I just use the regex module: 'Alternative regular expression module, to replace re.' It provides the familiarity of re but includes options for fuzzy matching, along with several other improvements on re.

For Windows binaries, see this resource.

Parallel Fuzzy Regular Expression and its Conversion to Epsilon , Abstract. Inspired by the applications of fuzzy automata and parallel regular expressions, we propose a new mathematical concept of parallel� Fuzzy Regexp Syntax New Fuzzy Regular Expression is created by instantiation of frej.Regex class. You need to pass "pattern" to constructor. Here rules of pattern syntax are explained.

Here is a resource on the question you are asking. It is a bit of a teaser for a company. More useful might be this paper. I've seen an implementation inspired by the paper that could do a fuzzy search, biased for special language (e.g. Arabic vs. English), on a large dataset.

In general, you won't be able to do what you asked about. You can make a regexp search fuzzy by replacing characters with equivalence classes, or you can search a database for near-matches defined by Levenshtein distance. Trying to expand the (n)DFA behind a regexp to include near-matches by distance would rapidly become impossibly complex.

Fuzzy Scoring Regex Mayhem – James Padolsey, Fuzzy Scoring Regex Mayhem. 07 Mar 2015. Autocompletion is never an entirely solved problem. Can anyone really say what on earth a user typing "uni" into a� Li and Pedrycz [Y. M. Li, W. Pedrycz, Fuzzy finite automata and fuzzy regular expressions with membership values in lattice ordered monoids, Fuzzy Sets and Systems 156 (2005) 68–92] have proved fundamental results that provide different equivalentways to represent fuzzy languageswithmembership values in a lattice-orderedmonoid, and generalize the well-known results of the classical theory of formal languages.

Have you considered using a lexer?

I've never actually used one so i can't be much help, but it sounds like it fits!

[PDF] Metamorphosis of Fuzzy Regular Expressions to Fuzzy , Metamorphosis of Fuzzy Regular Expressions to Fuzzy Automata using the Follow. Automata. Rahul Kumar Singh, Ajay Kumar. Thapar University Patiala. In particular, they have shown that a fuzzy language over an integral lattice-ordered monoid can be represented by a fuzzy regular expression if and only if it can be recognized by a fuzzy finite

Approximate string matching, In computer science, approximate string matching is the technique of finding strings that match algorithm � Plagiarism detection � Regular expressions for fuzzy and non-fuzzy matching; Smith–Waterman algorithm � Soundex � String metric� Full Lucene syntax also supports fuzzy search, matching on terms that have a similar construction. To do a fuzzy search, append the tilde ~ symbol at the end of a single word with an optional parameter, a value between 0 and 2, that specifies the edit distance. For example, blue~ or blue~1 would return blue, blues, and glue. Search expression

FREJ, Java tool and library for fuzzy (approximate) string matching and searching with addition of simple regular expressions mechanism. Regular expressions (regex or regexp) are extremely useful in extracting information from any text by searching for one or more matches of a specific search pattern (i.e. a specific sequence of

Overview (FREJ, Fuzzy Regexp Syntax. New Fuzzy Regular Expression is created by instantiation of frej.Regex class. You need to pass "pattern" to constructor. This pattern� A quick note: lazy or lazily in the context of regular expressions simply means that that thing will be intentionally excluded from the first match attempt and will only be used if the subsequent regular expression is unsuccessful without it. One caveat with the above regex is that it doesn’t allow a mistake to be at the beginning of the string.

Comments
  • The first one seams to be on standard approximate string matching? The second one seams to be on fuzzy lookups in a dictionary. That could probably be used thinking of the regex as a 'fictionary dictionary'?
  • I think lexers are more for tokenizing than matching. If I start splitting my string, I won't be able to recognize characters moved from one token to another.
  • You may have to define your problem as a lexing/parsing problem, rather than as a simple regular expression. Then you could use Levenshtein distance on the individual tokens.
  • I see. But the lexer link you've sent seams quite deterministic. What if instead of TV Schedule for 10 Jan I get TV Schedule for Jan 10? That should have a distance of 2, since two characters have been transposed. Maybe the lexer could indentify substrings looking like numbers or months, but then TV Schedule forJan 10 or TV Schedule for 10 Jan 2010 would give problems..