Cancelling a long running regex match?

Say I'm running a service where users can submit a regex to search through lots of data. If the user submits a regex that is very slow (ie. takes minutes for Matcher.find() to return), I want a way to cancel that match. The only way I can think of doing this is to have another thread monitor how long a match is taking and use Thread.stop() to cancel it if necessary.

Member variables:

long REGEX_TIMEOUT = 30000L;
Object lock = new Object();
boolean finished = false;
Thread matcherThread;

Matcher thread:

try {
    matcherThread = Thread.currentThread();

    // imagine code to start monitor thread is here

    try {
        matched = matcher.find();
    } finally {
        synchronized (lock) {
            finished = true;
            lock.notifyAll();
        }
    }
} catch (ThreadDeath td) {
    // send angry message to client
    // handle error without rethrowing td
}

Monitor thread:

synchronized (lock) {
    while (! finished) {
        try {
            lock.wait(REGEX_TIMEOUT);

            if (! finished) {
                matcherThread.stop();
            }
        } catch (InterruptedException ex) {
            // ignore, top level method in dedicated thread, etc..
        }
    }
}

I've read java.sun.com/j2se/1.4.2/docs/guide/misc/threadPrimitiveDeprecation.html and I think this usage is safe since I'm controlling where ThreadDeath is thrown via synchronisation and handle it and the only damaged objects could be my Pattern and Matcher instances which will be discarded anyway. I think this breaks Thread.stop() because I'm not rethrowing the error, but I don't really want the thread to die, just abort the find() method.

I've managed to avoid using these deprecated API components so far, but Matcher.find() does not seem to be interruptible and can take a very long time to return. Is there any better way to do this?

From Heritrix: (crawler.archive.org)

/**
 * CharSequence that noticed thread interrupts -- as might be necessary 
 * to recover from a loose regex on unexpected challenging input. 
 * 
 * @author gojomo
 */
public class InterruptibleCharSequence implements CharSequence {
    CharSequence inner;
    // public long counter = 0; 

    public InterruptibleCharSequence(CharSequence inner) {
        super();
        this.inner = inner;
    }

    public char charAt(int index) {
        if (Thread.interrupted()) { // clears flag if set
            throw new RuntimeException(new InterruptedException());
        }
        // counter++;
        return inner.charAt(index);
    }

    public int length() {
        return inner.length();
    }

    public CharSequence subSequence(int start, int end) {
        return new InterruptibleCharSequence(inner.subSequence(start, end));
    }

    @Override
    public String toString() {
        return inner.toString();
    }
}

Wrap your CharSequence with this one and Thread interrupts will work ...

evaluate a regular expression and timeout after a specified time period if the regular expression hasn't completed yet. Or a way for a user to cancel a long running regular expression. I was thinking there might be a technique I could use to evaluate regular expressions in a thread or another process launched via

With a little variation it is possible to avoid using additional threads for this:

public class RegularExpressionUtils {

    // demonstrates behavior for regular expression running into catastrophic backtracking for given input
    public static void main(String[] args) {
        Matcher matcher = createMatcherWithTimeout(
                "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx", "(x+x+)+y", 2000);
        System.out.println(matcher.matches());
    }

    public static Matcher createMatcherWithTimeout(String stringToMatch, String regularExpression, int timeoutMillis) {
        Pattern pattern = Pattern.compile(regularExpression);
        return createMatcherWithTimeout(stringToMatch, pattern, timeoutMillis);
    }

    public static Matcher createMatcherWithTimeout(String stringToMatch, Pattern regularExpressionPattern, int timeoutMillis) {
        CharSequence charSequence = new TimeoutRegexCharSequence(stringToMatch, timeoutMillis, stringToMatch,
                regularExpressionPattern.pattern());
        return regularExpressionPattern.matcher(charSequence);
    }

    private static class TimeoutRegexCharSequence implements CharSequence {

        private final CharSequence inner;

        private final int timeoutMillis;

        private final long timeoutTime;

        private final String stringToMatch;

        private final String regularExpression;

        public TimeoutRegexCharSequence(CharSequence inner, int timeoutMillis, String stringToMatch, String regularExpression) {
            super();
            this.inner = inner;
            this.timeoutMillis = timeoutMillis;
            this.stringToMatch = stringToMatch;
            this.regularExpression = regularExpression;
            timeoutTime = System.currentTimeMillis() + timeoutMillis;
        }

        public char charAt(int index) {
            if (System.currentTimeMillis() > timeoutTime) {
                throw new RuntimeException("Timeout occurred after " + timeoutMillis + "ms while processing regular expression '"
                                + regularExpression + "' on input '" + stringToMatch + "'!");
            }
            return inner.charAt(index);
        }

        public int length() {
            return inner.length();
        }

        public CharSequence subSequence(int start, int end) {
            return new TimeoutRegexCharSequence(inner.subSequence(start, end), timeoutMillis, stringToMatch, regularExpression);
        }

        @Override
        public String toString() {
            return inner.toString();
        }
    }

}

Thanks a lot to dawce for pointing me to this solution in answer to an unnecessary complicated question !

The.NET Framework offers CancellationToken class to cancel these long running tasks. You pass a CancellationToken to a Task, which then periodically monitors the token to see whether cancellation is requested. The CancellationToken is used in the asynchronous Task. The CancellationTokenSource is used to signal that the Task should cancel itself.

Another workaround would be to limit the region of the matcher, then call find(), repeating until the thread is interrupted or a match is found.

Cancellation can also be used in situations where we want to time out long-running tasks or cancel parallel tasks. Cancellation is controlled by the CancellationToken structure. You expose cancellation tokens in the signature of cancelable async methods, enabling them to be shared between the task and caller.

Maybe what you need is a new lib which implements the NFA algorithm.

The NFA algorithm is hundreds times faster than the algorithm which is used by Java standard library.

And the Java std lib is sensitive to the input regexp, which may make your problem happen -- some input make the CPU run for years.

And the timeout can be set by the NFA algorithm through the steps it uses. It is effective than the Thread solution. Trust me I use thread timeout to a relative problem, it is horrible for performance. I finally fix the problem by modify the main loop of my algorithm implement. I insert some check point to the main loop to test the time.

The detail can be found here: https://swtch.com/~rsc/regexp/regexp1.html .

As the novel coronavirus rapidly spread worldwide and concern continues to grow, a number of international sporting events have been impacted, including several professional running events.

I included a counter in order to check every n reads of charAt, in order to reduce the overhead.

Notes:

Some people stated that carAt may not be call frequently enough. I just added the foo variable in order to demostrate how much charAt is called, and that it is frequent enough. If you're going to use this in production, remove that counter, as it will decrease performance and end up overflowing long if ran in a server for long time. In this example, charAt is called 30 million times every 0.8 secs or so (not tested with proper microbenchmarking conditions, it is just a proof of concept). You can set a lower checkInterval if you want higher precission, at the cost of performance (System.currentTimeMillis() > timeoutTime is more expensive than the if clause on the long run.

import java.util.regex.Matcher;
import java.util.regex.Pattern;

import com.goikosoft.test.RegexpTimeoutException;

/**
 * Allows to create timeoutable regular expressions.
 *
 * Limitations: Can only throw RuntimeException. Decreases performance.
 *
 * Posted by Kris in stackoverflow.
 *
 * Modified by dgoiko to  ejecute timeout check only every n chars.
 * Now timeout < 0 means no timeout.
 *
 * @author Kris https://stackoverflow.com/a/910798/9465588
 *
 */
public class RegularExpressionUtils {

    public static long foo = 0;

    // demonstrates behavior for regular expression running into catastrophic backtracking for given input
    public static void main(String[] args) {
        long millis = System.currentTimeMillis();
        // This checkInterval produces a < 500 ms delay. Higher checkInterval will produce higher delays on timeout.
        Matcher matcher = createMatcherWithTimeout(
                "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx", "(x+x+)+y", 10000, 30000000);
        try {
            System.out.println(matcher.matches());
        } catch (RuntimeException e) {
            System.out.println("Operation timed out after " + (System.currentTimeMillis() - millis) + " milliseconds");
        }
        System.out.print(foo);
    }

    public static Matcher createMatcherWithTimeout(String stringToMatch, String regularExpression, long timeoutMillis,
                                                      int checkInterval) {
        Pattern pattern = Pattern.compile(regularExpression);
        return createMatcherWithTimeout(stringToMatch, pattern, timeoutMillis, checkInterval);
    }

    public static Matcher createMatcherWithTimeout(String stringToMatch, Pattern regularExpressionPattern,
                                                    long timeoutMillis, int checkInterval) {
        if (timeoutMillis < 0) {
            return regularExpressionPattern.matcher(stringToMatch);
        }
        CharSequence charSequence = new TimeoutRegexCharSequence(stringToMatch, timeoutMillis, stringToMatch,
                regularExpressionPattern.pattern(), checkInterval);
        return regularExpressionPattern.matcher(charSequence);
    }

    private static class TimeoutRegexCharSequence implements CharSequence {

        private final CharSequence inner;

        private final long timeoutMillis;

        private final long timeoutTime;

        private final String stringToMatch;

        private final String regularExpression;

        private int checkInterval;

        private int attemps;

        TimeoutRegexCharSequence(CharSequence inner, long timeoutMillis, String stringToMatch,
                                  String regularExpression, int checkInterval) {
            super();
            this.inner = inner;
            this.timeoutMillis = timeoutMillis;
            this.stringToMatch = stringToMatch;
            this.regularExpression = regularExpression;
            timeoutTime = System.currentTimeMillis() + timeoutMillis;
            this.checkInterval = checkInterval;
            this.attemps = 0;
        }

        public char charAt(int index) {
            if (this.attemps == this.checkInterval) {
                foo++;
                if (System.currentTimeMillis() > timeoutTime) {
                    throw new RegexpTimeoutException(regularExpression, stringToMatch, timeoutMillis);
                }
                this.attemps = 0;
            } else {
                this.attemps++;
            }

            return inner.charAt(index);
        }

        public int length() {
            return inner.length();
        }

        public CharSequence subSequence(int start, int end) {
            return new TimeoutRegexCharSequence(inner.subSequence(start, end), timeoutMillis, stringToMatch,
                                                regularExpression, checkInterval);
        }

        @Override
        public String toString() {
            return inner.toString();
        }
    }

}

And the custom exception, so you can catch only THAT exception to avoid swalowing other RE Pattern / Matcher may throw.

public class RegexpTimeoutException extends RuntimeException {
    private static final long serialVersionUID = 6437153127902393756L;

    private final String regularExpression;

    private final String stringToMatch;

    private final long timeoutMillis;

    public RegexpTimeoutException() {
        super();
        regularExpression = null;
        stringToMatch = null;
        timeoutMillis = 0;
    }

    public RegexpTimeoutException(String message, Throwable cause) {
        super(message, cause);
        regularExpression = null;
        stringToMatch = null;
        timeoutMillis = 0;
    }

    public RegexpTimeoutException(String message) {
        super(message);
        regularExpression = null;
        stringToMatch = null;
        timeoutMillis = 0;
    }

    public RegexpTimeoutException(Throwable cause) {
        super(cause);
        regularExpression = null;
        stringToMatch = null;
        timeoutMillis = 0;
    }

    public RegexpTimeoutException(String regularExpression, String stringToMatch, long timeoutMillis) {
        super("Timeout occurred after " + timeoutMillis + "ms while processing regular expression '"
                + regularExpression + "' on input '" + stringToMatch + "'!");
        this.regularExpression = regularExpression;
        this.stringToMatch = stringToMatch;
        this.timeoutMillis = timeoutMillis;
    }

    public String getRegularExpression() {
        return regularExpression;
    }

    public String getStringToMatch() {
        return stringToMatch;
    }

    public long getTimeoutMillis() {
        return timeoutMillis;
    }

}

Based on Andreas' answer. Main credits should go for him and his source.

If this is happening only on one (or a few) website (s) there really is (by Internet Explorer standards) a long running script on the page and IE is doing you a favor by providing the "Stop script" option. Click on the Stop script button. Again, if only a site or two are affected

Task Cancellation. 03/30/2017; 3 minutes to read +6; In this article. The System.Threading.Tasks.Task and System.Threading.Tasks.Task<TResult> classes support cancellation through the use of cancellation tokens in the .NET Framework. For more information, see Cancellation in Managed Threads. In the Task classes, cancellation involves

The ASP.NET Core Worker Service template provides a starting point for writing long running service apps. An app created from the Worker Service template specifies the Worker SDK in its project file: <Project Sdk="Microsoft.NET.Sdk.Worker"> To use the template as a basis for a hosted services app:

Secondly, and more importantly, the best running headphones should give you a secure and comfortable fit, whether they're over-ear headphones, bone conduction, or wireless in-ear buds.

Comments
  • Personally, I think allowing users to submit a regex as a search criteria is a bad idea. Programmers maybe, but not end users...
  • Certainly you should expect to get DoSed if you accept arbitrary regexs.
  • Not all code is exposed to a public network where you have to worry about DoS.
  • It would be slightly faster if you moved the exception bit out of charAt, although the real problem is likely to be inefficient patterns rather than large target text.
  • All backtracking regex algorithms are well known to become pathalogical if certain evil patterns are specified. As Tom says, it's unlikely that a huge regex run time would be caused by reading the input string but much more likely because of excessive backtracking in the regex algorithm. It might be that the regex could get stuck for ages before reading another character (I've seen this before in perl) so the interupt might not work for some time.
  • @Kris charAt is never get called. What could be the problem? please, see my question here stackoverflow.com/questions/17674839/…
  • I agree, this can't work. Regexp matcher copies the charsequence to a string.
  • @Benj current Java implementation of Matcher does actually call charAt again even when backtracking, even in small strings carAt is hit millions of times. Just performed a test. In 2,3 seconds using the string xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx and the pattern (x+x+)+y it called charAt 90 million times (without heating first. I asumme that testing in proper environment following the guidelines for microbenchmarking would execute it even more times in the given period).
  • +1 Suggestion: currentTimeMillis() is a pretty expensive operation. Add a counter and only call it every Nth time charAt() is called.
  • Great answer. Anyone using this will want to throw a custom exception rather than RuntimeException though.
  • Yeah, indeed it was a great solution but limited to throw only RTE. If I want to send my custom exception I think we need to create one more wrapper on top of TimeoutRegexCharSequence class.
  • This solution doesn't really solve the problem, what if the charAt takes a long time? so you cannot guarantee the timout defined
  • @lssilva charAt is called millions of times per second, even in short strings. Do your own tests if you don't trust me.
  • @DGoiko I just did a test, create a method that generates a random string (geeksforgeeks.org/generate-random-string-of-given-size-in-java) and run the code:println(System.currentTimeMillis()) println(getAlphaNumericString(Int.MaxValue - 1000).charAt(900000)) println(System.currentTimeMillis()) it took almost 1min
  • @lssilva I meant reading chars foir regexp. Take code from stackoverflow.com/a/11348374/9465588 and set a counter in charAt to see how many times it is read in the desired amount of time. You can test any long-running regexp you can think of with it.