How to detect Chinese character with punctuation in regex?

regex chinese characters javascript
regex for hindi characters
regex unicode range
regex unicode javascript
python remove chinese punctuation
regex p
regex punctuation
java regex unicode

I have noticed that there are question asked on how to detect Chinese character with regex. These are some question I read on stackoverflow:

Php check if the string has Chinese chars

detecting chinese characters in php string

And some article outside stackoverflow:

http://www.regular-expressions.info/unicode.html - unicode script

Basically they suggest using either \p{Han}+ or [\x{4e00}-\x{9fa5}]+.* to detect Chinese character. Is there a way to detect Chinese punctuation mark as well?

Some example (but not all) of Chinese punctuation mark: :?。,《》「」『』-()【】


I suggest informing yourself by taking a look at Zhon, a Python library that provides constants commonly used in Chinese text processing.

Luckily, hanzi.py contains a definition of a regex that should pretty much suit your needs:

#: A regular expression pattern for a Chinese sentence. A sentence is defined
#: as a series of characters and non-stop punctuation marks followed by a stop
#: and zero or more container-closing punctuation marks (e.g. apostrophe or brackets).

sent = sentence = '[{characters}{radicals}{non_stops}]*{sentence_end}'.format(
    characters=characters, radicals=radicals, non_stops=non_stops,
    sentence_end=_sentence_end)

The definition above results in the following regex*:

[〇一-鿿㐀-䶿豈-﫿𠀀-𪛟𪜀-𫜿𫝀-𫠟丽-𯨟⼀-⿕⺀-⻳"#$%&'()*+,-/:;<=>@[\]^_`{|}~⦅⦆「」、 、〃〈〉《》「」『』【】〔〕〖〗〘〙〚〛〜〝〞〟〰〾〿–—‘’‛""„‟…‧﹏﹑﹔·]*[!?。。][」﹂"』’》)]}〕〗〙〛〉】]*

Code Example:

preg_match_all('/[〇一-鿿㐀-䶿豈-﫿𠀀-𪛟𪜀-𫜿𫝀-𫠟丽-𯨟⼀-⿕⺀-⻳"#$%&'()*+,-/:;<=>@[\]^_`{|}~⦅⦆「」、 、〃〈〉《》「」『』【】〔〕〖〗〘〙〚〛〜〝〞〟〰〾〿–—‘’‛""„‟…‧﹏﹑﹔·]*[!?。。][」﹂"』’》)]}〕〗〙〛〉】]*/', "我的中文不好。我是意大利人。你知道吗?", $matches, PREG_SET_ORDER, 0);
var_dump($matches);

If you prefer using Character code ranges for pertinent CJK ideograph Unicode blocks reference the Python source I have linked or get it from the Javascript sample below:

const regex = /[\u3007u4E00-\u9FFF\u3400-\u4DBF\uF900-\uFAFF\u20000-\u2A6DF\u2A700-\u2B73F\u2B740-\u2B81F\u0002F800-\u2FA1F\u2F00-\u2FD5\u2E80-\u2EF3\uFF02\uFF03\uFF04\uFF05\uFF06\uFF07\uFF08\uFF09\uFF0A\uFF0B\uFF0C\uFF0D\uFF0F\uFF1A\uFF1B\uFF1C\uFF1D\uFF1E\uFF20\uFF3B\uFF3C\uFF3D\uFF3E\uFF3F\uFF40\uFF5B\uFF5C\uFF5D\uFF5E\uFF5F\uFF60\uFF62\uFF63\uFF64\u3000\u3001\u3003\u3008\u3009\u300A\u300B\u300C\u300D\u300E\u300F\u3010\u3011\u3014\u3015\u3016\u3017\u3018\u3019\u301A\u301B\u301C\u301D\u301E\u301F\u3030\u303E\u303F\u2013\u2014\u2018\u2019\u201B\u201C\u201D\u201E\u201F\u2026\u2027\uFE4F\uFE51\uFE54\u00B7]*[\uFF01\uFF1F\uFF61\u3002][」﹂"』’》)]}〕〗〙〛〉】]*/gm;
const str = `我的中文不好。我是意大利人。你知道吗?`;
let m;

while ((m = regex.exec(str)) !== null) {
    // This is necessary to avoid infinite loops with zero-width matches
    if (m.index === regex.lastIndex) {
        regex.lastIndex++;
    }
    
    // The result can be accessed through the `m`-variable.
    m.forEach((match, groupIndex) => {
        console.log(`Found match, group ${groupIndex}: ${match}`);
    });
}

Using Regular Expression to Filter Chinese, RegexBuddy—Better than a regular expression tutorial! Unicode Regular Expressions. Unicode is a character set that aims to define all characters and glyphs from \p{Po} or \p{Other_Punctuation}: any kind of punctuation character that is not  Curly braces in regex apply a quantifier to the preceding, so in this case it's replacing between 2 and 100 whitespace characters (\s) with a single space. If you want to collapse any number of whitespace characters down to one, you would leave off the upper limit like so: replace(/\s{2,}/g, ' ').


Is there a way to detect Chinese punctuation mark as well?

This is a basic Unicode property regex to get just the punctuation. It says get all \p{Han} script characters, only if they are in the CJK block for symbols and punctuation.

This effectively filters to Han symbols and punctuation.

As of Unicode 10, this turns out to be these 15 characters: 々〇〡〢〣〤〥〦〧〨〩〸〹〺〻

\p{Han}(?<=\p{Block=CJK_Symbols_And_Punctuation})

If the engine you're using doesn't support properties or doesn't support the block symbols, you can use a class range of the resulting 15 chars. which is one of these

[\x{3005}\x{3007}\x{3021}-\x{3029}\x{3038}-\x{303B}] [\u3005\u3007\u3021-\u3029\u3038-\u303B]

Source: http://www.regexformat.com UCD Interface.

Regex Tutorial, By default, regular expressions also treat 4-byte “long characters” as And Number property means that it's a digit: maybe Arabic or Chinese, and so on. in lower case, we can write \p{Ll} , punctuation signs: \p{P} and so on. I need a regular expression to match against all punctuation marks, such as the standard [,!@#$%^&*()], but including international marks like the upside-down Spanish question mark, Chinese periods, etc. My google-fu is coming up short. Does anyone have such a regular expression on hand that's compatible with Javascript?


Most of the characters you want to match can be matched with:

[\x{FF1F}-\x{FF2D}\x{FF01}-\x{FF1E}\x{3001}-\x{30AD}]+

Zhon, regex - Javascript unicode string, chinese character but no punctuatio. javascript - Regex to match all words, inclusive of punctuation and un. This makes the user double-check their work in only a small amount of circumstances where  @user1428716: That is regex being put in there, so it will detect the characters within the range. However, whether it is correct Japanese or not is another matter – nhahtdh Feb 22 '13 at 22:40 add a comment |


[\u4e00-\u9fa5] works for me in VSCode, the other suggested solutions don't. I stumbled on the solution here, which is a online regex interpreter: http://tool.chinaz.com/regex/

Unicode: flag "u" and class \p{}, Unicode is a large character set—regular expression engines that are only and consist of: Letters, Punctuation, Symbols, Marks, Numbers, Separators, and Other​. and other (in this case, other includes uncased letters such as Chinese). A regular-expression mechanism may choose to offer the ability to identify  Dealing with Regular Expressions A regular expression (aka regex) is a sequence of characters that define a search pattern, mainly for use in pattern matching with text strings. Typically, regex patterns consist of a combination of alphanumeric characters as well as special characters.


regex, regular expression (RE), a language for specifying text search strings. This prac This sentence has 13 words if we don't count punctuation marks as words, 15 mately tied up with named entity detection, the task of detecting names, dates, and In Chinese, for example, words are composed of characters (called hanzi in. Regex to test for presence of Japanese characters. GitHub Gist: instantly share code, notes, and snippets. the characters from 0x4e00 to 0x9faf include Chinese


UTR #18: Unicode Regular Expression Guidelines, Hi There, I am looking to remove 2 special characters which are " & ' I have used the data cleansing tool to remove punctuation, however, it removes a few characters that I require. I have Current Data '20130326.0018289187' "PCHQ9N46M509B" Required Results 20130326.0018289187 PCHQ9N46M509B


[PDF] Regular Expressions: The Complete Tutorial, If \ is used as an escape character in regular expressions, how do you match a literal \? Well you need to escape it, creating the regular expression \\. To create that regular expression, you need to use a string, which also needs to escape \. That means to match a literal \ you need to write "\\\\" — you need four backslashes to match one!