Concrete Javascript Regex for Accented Characters (Diacritics)

regex accented characters java
regular expression for alphanumeric and special characters in java
regex accented characters python
javascript convert special characters to normal
regex for all special characters
regex spanish characters
regex for german characters
regex special characters

I've looked on Stack Overflow (replacing characters.. eh, how JavaScript doesn't follow the Unicode standard concerning RegExp, etc.) and haven't really found a concrete answer to the question:

How can JavaScript match for accented characters (those with diacritical marks)?

I'm forcing a field in a UI to match the format: last_name, first_name (last [comma space] first), and I want to provide support for diacritics, but evidently in JavaScript it's a bit more difficult than other languages/platforms.

This was my original version, until I wanted to add diacritic support:

/^[a-zA-Z]+,\s[a-zA-Z]+$/

Currently I'm debating one of three methods to add support, all of which I have tested and work (at least to some extent, I don't really know what the "extent" is of the second approach). Here they are:

Explicitly listing all accented characters that I would want to accept as valid (lame and overly-complicated):

var accentedCharacters = "àèìòùÀÈÌÒÙáéíóúýÁÉÍÓÚÝâêîôûÂÊÎÔÛãñõÃÑÕäëïöüÿÄËÏÖÜŸçÇßØøÅåÆæœ";
// Build the full regex
var regex = "^[a-zA-Z" + accentedCharacters + "]+,\\s[a-zA-Z" + accentedCharacters + "]+$";
// Create a RegExp from the string version
regexCompiled = new RegExp(regex);
// regexCompiled = /^[a-zA-ZàèìòùÀÈÌÒÙáéíóúýÁÉÍÓÚÝâêîôûÂÊÎÔÛãñõÃÑÕäëïöüÿÄËÏÖÜŸçÇßØøÅåÆæœ]+,\s[a-zA-ZàèìòùÀÈÌÒÙáéíóúýÁÉÍÓÚÝâêîôûÂÊÎÔÛãñõÃÑÕäëïöüÿÄËÏÖÜŸçÇßØøÅåÆæœ]+$/
  • This correctly matches a last/first name with any of the supported accented characters in accentedCharacters.

My other approach was to use the . character class, to have a simpler expression:
var regex = /^.+,\s.+$/;
  • This would match for just about anything, at least in the form of: something, something. That's alright I suppose...

The last approach, which I just found might be simpler...
/^[a-zA-Z\u00C0-\u017F]+,\s[a-zA-Z\u00C0-\u017F]+$/
  • It matches a range of unicode characters - tested and working, though I didn't try anything crazy, just the normal stuff I see in our language department for faculty member names.

Here are my concerns:

  1. The first solution is far too limiting, and sloppy and convoluted at that. It would need to be changed if I forgot a character or two, and that's just not very practical.
  2. The second solution is better, concise, but it probably matches far more than it actually should. I couldn't find any real documentation on exactly what . matches, just the generalization of "any character except the newline character" (from a table on the MDN).
  3. The third solution seems the be the most precise, but are there any gotchas? I'm not very familiar with Unicode, at least in practice, but looking at a code table/continuation of that table, \u00C0-\u017F seems to be pretty solid, at least for my expected input.

    • Faculty won't be submitting forms with their names in their native language (e.g., Arabic, Chinese, Japanese, etc.) so I don't have to worry about out-of-Latin-character-set characters

So the real question(s): Which of these three approaches is most suited for the task? Or are there better solutions?

The easier way to accept all accents is this:

[A-zÀ-ú] // accepts lowercase and uppercase characters
[A-zÀ-ÿ] // as above but including letters with an umlaut (includes [ ] ^ \ × ÷)
[A-Za-zÀ-ÿ] // as above but not including [ ] ^ \
[A-Za-zÀ-ÖØ-öø-ÿ] // as above but not including [ ] ^ \ × ÷

See https://unicode-table.com/en/ for characters listed in numeric order.

Concrete Javascript Regex for Accented Characters (Diacritics), I've looked on Stack Overflow (replacing characters.. eh, how JavaScript doesn't follow the Unicode standard concerning RegExp, etc.) and haven't really found  Accented characters and regular expression. javascript regex encoding diacritics. Concrete Javascript Regex for Accented Characters (Diacritics)

allow alphabets and accented characters, Concrete Javascript Regex for Accented Characters (Diacritics). regex accented characters java regular expression for alphanumeric and special characters in  How can JavaScript match for accented characters (those with diacritical marks)? Je force un champ dans un UI pour correspondre au format: last_name, first_name (dernier [espace virgule] en premier) , et je veux fournir un support pour diacritiques, mais évidemment en JavaScript c'est un peu plus difficile que d'autres langues/plateformes.

Which of these three approaches is most suited for the task?

Depends on the task :-) To match exactly all Latin characters and their accented versions, the Unicode ranges probably provide the best solution. They might be extended to all non-whitespace characters, which could be done using the \S character class.

I'm forcing a field in a UI to match the format: last_name, first_name (last [comma space] first)

The most basic problem I'm seeing here are not diacritics, but whitespaces. There are a few names that consist of multiple words, e.g. for titles. So you should go with the most generic, that is allowing everything but the comma that distinguishes first from last name:

/[^,]+,\s[^,]+/

But your second solution with the . character class is just as fine, you only might need to care about multiple commata then.

What every JavaScript developer should know about Unicode, Regular Expression to javascirpt. RegEx Testing From Dan's Tools. Web Dev. HTML/JS/CSS allow alphabets and accented characters. javascirpt  I want to validate a name-string. It can contain normal uppper/lowercase characters like A-Z, a-z, as well as spaces. But names can also have accented characters like é and Ä, Ö, Ü etc.

The XRegExp library has a plugin named Unicode that helps solve tasks like this.

<script src="xregexp.js"></script>
<script src="addons/unicode/unicode-base.js"></script>
<script>
  var unicodeWord = XRegExp("^\\p{L}+$");

  unicodeWord.test("Русский"); // true
  unicodeWord.test("日本語"); // true
  unicodeWord.test("العربية"); // true
</script>

It's mentioned in the comments to the question, but it's easy to miss. I've noticed it only after I submitted this answer.

Interesting Regex Character Classes, Unicode in JavaScript: basic concepts, escape sequences, normalization, 3.3 String length; 3.4 Character positioning; 3.5 Regular expression match. 4. A concrete image of a grapheme displayed on the screen is named glyph. Combining marks include such characters as accents, diacritics, Hebrew  Concrete Javascript Regex for Accented Characters(Diacritics) (4) I've looked on Stack Overflow (replacing characters.. eh, how JavaScript doesn't follow the Unicode standard concerning RegExp, etc.) and haven't really found a concrete answer to the question:

How about this?

/^[a-zA-ZÀ-ÖØ-öø-ÿ]+$/

Jasenkoo's solution to Isogram on the JavaScript track, Presents a collection of useful character classes for regular expressions. All "​Special Characters" in the ASCII Table—Without Using Lookahead While [\]\[] will work everywhere, in JavaScript you can use [[\]], and in Java you can use []\[]. To display an accented character in a JavaScript alert message or a confirm dialog box, use the hexadecimal code of the character, for example: alert('\xC5ngstr\xF6m is a unit of length.') //Try it! The following table lists the HTML entities, character codes, and URL-encodings for accented Latin letters and ligatures.

UTS #18: Unicode Regular Expressions, Unicode is a large character set—regular expression engines that are only However, it is important to have a concrete syntax to correctly illustrate the different issues. The Code Point type is a special case of a String type where the Extended-E, Latin Extended Additional, Combining Diacritical Marks. Hello, I need to replace characters with an accent with their base letter á => a ñ => n I can use equivalence classes like [[=n=]] in regexp_replace, but then I need to call it for each base letter.

How to hide redundant french content in js?, In your regular expression, \w doesn't match characters with accents (diacritics). See Concrete Javascript Regex for Accented Characters  The proposed regex improves internationalization. Allows accents, ñ, ç, and other widely used letters. Before: hello-cañapolísas =&gt; Hello-caÑApolÍSas Now: hello-cañapolísas =&gt; Hello-C

Interesting Character Classes My goal with this page is to assemble a collection of interesting (and potentially useful) regex character classes. I will try to organize the collection into themes.

Comments
  • There seems to be no particular reason to use the more complicated regexps. Only thing about the most simple solution is, it will also match "something, something, something". You could use something like regex = /^[^,]+,\s[^,]+$/; to prevent that.
  • At a glance, the first one won't match the common name "O'Donnell, Chris" nor compound last names with a hyphen, nor multiple last names (etc.). See Falsehoods Programmers Believe About Names for just about every possible pitfalls.
  • "the . atom matches anything except newlines" actually is quite exact :-)
  • If it is possible for you to use an additional library you can have a look at my answer here
  • Jongware, I actually just read that article while I was browsing SO for an answer to my question - I also completely forgot about hyphens and apostrophes and the like, I was more concerned with making it international first :P I'm glad you brought it up though! And Stema, I actually looked at that library and I avoid incorporating libraries because this is all on Google Apps Script - incorporating external libraries would be a nightmare, and I would only be using it (in this case) for one particular field... kind of overkill :P
  • It works nicely, +1, but could you elaborate why it works ?
  • @PierreHenry the - defines a range, and this technique exploits the ordering of characters in the charset to define a continuous range, making for a super concise solution to the problem
  • won't this match underscores (and the other non-word characters between Z and a)?
  • This matches at least the characters [, ], ^, and \, none of which should be included.
  • Not working, few characters in this range are not accented characters (U+00D7 is the multiplication sign for example) see this: unicode-table.com/en
  • Having a look at the unicode table latin block, I think you should also include \u1e00-\u1eff, so I'm doing [a-zA-Z\u00c0-\u024f\u1e00-\u1eff]