Javascript RegExp + Word boundaries + unicode characters

regex word boundary
javascript regex word boundary
regex match any word
regex match all words
regex custom word boundary
regex match word
regex boundary matchers
regular expression starts with word and ends with word

I am building search and I am going to use javascript autocomplete with it. I am from Finland (finnish language) so I have to deal with some special characters like ä, ö and å

When user types text in to the search input field I try to match the text to data.

Here is simple example that is not working correctly if user types for example "ää". Same thing with "äl"

var title = "this is simple string with finnish word tämä on ääkköstesti älkää ihmetelkö";
// Does not work
var searchterm = "äl";

// does not work
//var searchterm = "ää";

// Works
//var searchterm = "wi";

if ( new RegExp("\\b"+searchterm, "gi").test(title) ) {
    $("#result").html("Match: ("+searchterm+"): "+title);
} else {
    $("#result").html("nothing found with term: "+searchterm);   
}

http://jsfiddle.net/7TsxB/

So how can I get those ä,ö and å characters to work with javascript regex?

I think I should use unicode codes but how should I do that? Codes for those characters are: [\u00C4,\u00E4,\u00C5,\u00E5,\u00D6,\u00F6]

=> äÄåÅöÖ

There appears to be a problem with Regex and the word boundary \b matching the beginning of a string with a starting character out of the normal 256 byte range.

Instead of using \b, try using (?:^|\\s)

var title = "this is simple string with finnish word tämä on ääkköstesti älkää ihmetelkö";
// Does not work
var searchterm = "äl";

// does not work
//var searchterm = "ää";

// Works
//var searchterm = "wi";

if ( new RegExp("(?:^|\\s)"+searchterm, "gi").test(title) ) {
    $("#result").html("Match: ("+searchterm+"): "+title);
} else {
    $("#result").html("nothing found with term: "+searchterm);   
}

Breakdown:

(?: parenthesis () form a capture group in Regex. Parenthesis started with a question mark and colon ?: form a non-capturing group. They just group the terms together

^ the caret symbol matches the beginning of a string

| the bar is the "or" operator.

\s matches whitespace (appears as \\s in the string because we have to escape the backslash)

) closes the group

So instead of using \b, which matches word boundaries and doesn't work for unicode characters, we use a non-capturing group which matches the beginning of a string OR whitespace.

Javascript RegExp + Word boundaries + unicode characters, There appears to be a problem with Regex and the word boundary \b matching the beginning of a string with a starting character out of the  The problem is not about word boundary (and its emulation), but with the regex that you are currently having. – nhahtdh Feb 18 '13 at 23:04. My goal is to change pairs of double quotes"äöõ" into fancy quotes «äöõ». On nested quotes it should replace not matching pairs but 1st and 3rd quote, then 2nd and 4th.

The \b character class in JavaScript RegEx is really only useful with simple ASCII encoding. \b is a shortcut code for the boundary between \w and \W sets or \w and the beginning or end of the string. These character sets only take into account ASCII "word" characters, where \w is equal to [a-zA-Z0-9_] and \W is the negation of that class.

This makes the RegEx character classes largely useless for dealing with any real language.

\s should work for what you want to do, provided that search terms are only delimited by whitespace.

Regex Tutorial - \b Word Boundaries, In regular expressions, \b anchors the regex at a word boundary or the position between Between two characters in the string, where one is a word character and the other is not a word character. Java supports Unicode for \b but not for \​w. A word boundary \b is a test, just like ^ and $. When the regexp engine (program module that implements searching for regexps) comes across \b, it checks that the position in the string is a word boundary. There are three different positions that qualify as word boundaries: At string start, if the first string character is a word character \w.

this question is old, but I think I found a better solution for boundary in regular expressions with unicode letters. Using XRegExp library you can implement a valid \b boundary expanding this

XRegExp('(?=^|$|[^\\p{L}])')

the result is a 4000+ char long, but it seems to work quite performing.

Some explanation: (?= ) is a zero-length lookahead that looks for a begin or end boundary or a non-letter unicode character. The most important think is the lookahead, because the \b doesn't capture anything: it is simply true or false.

Word Boundaries, RegexBuddy—Better than a regular expression reference! R, JavaScript, VBScript, XRegExp, Python, Ruby, std::regex, Boost, Tcl ARE, POSIX BRE Word boundary, \b, Matches at a position that is followed by a word character but not \b. matches a, , and d in abc def, Unicode, non‑ECMA Unicode, Unicode, Unicode  Character classes distinguish kinds of characters such as, for example, distinguishing between letters and digits. Has one of the following meanings: Matches any single character except line terminators: , \r, \u2028 or \u2029. For example, /.y/ matches "my" and "ay", but not "yes", in "yes make my day".

I would recommend you to use XRegExp when you have to work with a specific set of characters from Unicode, the author of this library mapped all kind of regional sets of characters making the work with different languages easier.

Regex Boundaries—Word Boundaries and More, In PCRE (PHP, R…) with the Unicode mode turned off, JavaScript and Python 2.7​, it matches where only one side is an ASCII letter, digit or underscore. A word boundary, in most regex dialects, is a position between \w and \W (non-word char), or at the beginning or end of a string if it begins or ends (respectively) with a word character ([0-9A-Za-z_]). So, in the string "-12", it would match before the 1 or after the 2. The dash is not a word character.

I noticed something really weird with \b when using Unicode:

/\bo/.test("pop"); // false (obviously)
/\bä/.test("päp"); // true (what..?)

/\Bo/.test("pop"); // true
/\Bä/.test("päp"); // false (what..?)

It appears that meaning of \b and \B are reversed, but only when used with non-ASCII Unicode? There might be something deeper going on here, but I'm not sure what it is.

In any case, it seems that the word boundary is the issue, not the Unicode characters themselves. Perhaps you should just replace \b with (^|[\s\\/-_&]), as that seems to work correctly. (Make your list of symbols more comprehensive than mine, though.)

Character classes, \w (“w” is from “word”): A “wordly” character: either a letter of Latin The match (​each regexp character class has the corresponding result character): Unicode encoding, used by JavaScript for strings, provides many  . – any character if with the regexp 's' flag, otherwise any except a newline . …But that’s not all! Unicode encoding, used by JavaScript for strings, provides many properties for characters, like: which language the letter belongs to (if it’s a letter) it is it a punctuation sign, etc. We can search by these properties as well.

Regular expressions, Regular expressions are patterns used to match character This chapter describes JavaScript regular expressions. Assertions: Assertions include boundaries, which indicate the beginnings and endings of lines and words, and other Unicode property escapes: Distinguish based on unicode character  Unicode: flag "u" and class \p{}. JavaScript uses Unicode encoding for strings. Most characters are encoding with 2 bytes, but that allows to represent at most 65536 characters.

word boundary search for words having with accented character or , What is work around for finding words having accented characters in jquery ? it is used as an literal, not as special character to define a unicode category or block. See: http://www.regular-expressions.info/javascript.html. Matches a word boundary. This is the position where a word character is not followed or preceded by another word-character, such as between a letter and a space. Note that a matched word boundary is not included in the match. In other words, the length of a matched word boundary is zero. Examples: /\bm/ matches the 'm' in "moon" ;

Word boundaries in JavaScript's regular expressions with UTF-8 , Word boundaries in JavaScript's regular expressions with UTF-8 strings guess word boundaries when other UTF-8 characters are present. Unicode Regular Expressions Unicode is a character set that aims to define all characters and glyphs from all human languages, living and dead. With more and more software being required to support multiple languages, or even just any language, Unicode has been strongly gaining popularity in recent years.

Comments
  • @Walkerneo: \b means "word boundary" in a regex; the slash is escaped here because it's in a string.
  • @apsillers, Thanks, weird that I'd not seen that before :/
  • I use the \b because I want to match at the beginning of each word.
  • As you see, Javascript is stuck in the idiotic 1960’s-style ASCII-only mentality. It does not meet even the most basic conformance requirements needed for Level 1’s "Basic Unicode Support" per UTS#18 on Unicode Regular Expressions. Trying to do real Unicode text-processing work in Javascript an awful joke, and a cruel one, too: it cannot be done. The XRegexp plugin mentioned below is necessary but not sufficient for these purposes.
  • Newcomers beware: This cannot be done in regexp. Not with \b, not with \s, not with XRegExp, not with lookaheads or lookarounds. Believe me, I've tried it all, and everything broke in some or other way. The only reliable way I've found that up until now works is encoding the unicode string back to ascii and perform an ascii only regexp search/replace with \b as originally intended. See here: stackoverflow.com/a/10590188/1329367
  • "try this" isn't a solution. Give some information about why the suggested regex works. What does (?:^|\\s) really do? You don't explain this solution at all.
  • This is NOT a correct solution. (?:^|\\s) is not a zero-width assertion like \b is, and will consume characters from the match. A positive lookahead would be a better idea ((?=^|\\s)) but would only work after the match, as lookbehind is still not supported. Also, word boundaries are not just spaces and string boundaries, but a ton of other characters.
  • Is there any reason not to include $ (end of string) in the regex? I.e. (?:^|\s|$)
  • The proposed regexp doesn't have the same behavior when the match is at the beginning of a string or after a whitespace. When it matches at the beginning the matched text is returned, however when it matches after a whitespace it also returns the whitespace as part of the match, even though the capture is done with the colon. Test code (executed in Firefox console): let str1 = "un ejemplo"; let str2 = "ejemplo uno"; let reg = /(?:^|\s)un/gi; str1.match(reg); // ["un"] str2.match(reg); // [" un"]
  • This also matches partial string matches. '¿dónde está la alcaldesa?': es and está are matched, which is bad. Only está should be matched. \\b is supposed to be helpful with full-word boundaries.
  • +1, but \b is not a character class shorthand like \w and \s, it's a zero-width assertion like \A, $, and lookarounds.
  • \b and \B aren't Unicode-aware in JavaScript, so they consider ä a non-alphanumeric character and therefore see a word boundary between p and ä.
  • This is a great idea, and the only thing that worked for me. Instead of QQ you can use a control string of ___ which is a bit safer and still ascii, and instead of encodeURI you can leverage javascript's native escape/unescape methods, but otherwise it does the job.