Difference between utf8mb4_unicode_ci and utf8mb4_unicode_520_ci collations in MariaDB/MySQL?

utf8mb4_unicode_ci vs utf8mb4_general_ci
mysql collation
mysql utf8
mysql collation utf8
utf8mb4_0900_ai_ci
difference between utf8_general_ci and latin1_swedish_ci
mariadb utf8mb4_0900_ai_ci
utf8mb4_unicode_ci wordpress

I logged into MariaDB/MySQL and entered:

SHOW COLLATION;

I see utf8mb4_unicode_ci and utf8mb4_unicode_520_ci among the available collations. What is the difference between these two collations and which should we be using?

Well you shall need to read in to the documentation. I can't tell you what you should be using because every project is different.

10.1.3 Collation Naming Conventions

MySQL collation names follow these conventions:

A collation name starts with the name of the character set with which it is associated, followed by one or more suffixes indicating other collation characteristics. For example, utf8_general_ci and latin_swedish_ci are collations for the utf8 and latin1 character sets, respectively.

A language-specific collation includes a language name. For example, utf8_turkish_ci and utf8_hungarian_ci sort characters for the utf8 character set using the rules of Turkish and Hungarian, respectively.

Case sensitivity for sorting is indicated by _ci (case insensitive), _cs (case sensitive), or _bin (binary; character comparisons are based on character binary code values). For example, latin1_general_ci is case insensitive, latin1_general_cs is case sensitive, and latin1_bin uses binary code values.

For Unicode, collation names may include a version number to indicate the version of the Unicode Collation Algorithm (UCA) on which the collation is based. UCA-based collations without a version number in the name use the version-4.0.0 UCA weight keys. For example:

utf8_unicode_ci (with no version named) is based on UCA 4.0.0 weight keys >(http://www.unicode.org/Public/UCA/4.0.0/allkeys-4.0.0.txt).

utf8_unicode_520_ci is based on UCA 5.2.0 weight keys (http://www.unicode.org/Public/UCA/5.2.0/allkeys.txt).

For Unicode, the xxx_general_mysql500_ci collations preserve the pre-5.1.24 ordering of the original xxx_general_ci collations and permit upgrades for tables created before MySQL 5.1.24. For more information, see Section 2.11.3, "Checking Whether Tables or Indexes Must Be Rebuilt", and Section 2.11.4, "Rebuilding or Repairing Tables or Indexes".

Source : https://dev.mysql.com/doc/refman/5.6/en/charset-collation-names.html

unicode, 4. I logged into MariaDB/MySQL and entered: SHOW COLLATION;. I see utf8mb4_unicode_ci and utf8mb4_unicode_520_ci among the available collations. What is the difference between these two collations and which should we be using? utf8 uses a maximum of three bytes per character while utf8mb4 uses four bytes per character. While the utf8 charset is able to store Chinese, Japanese, and Korean characters (which are in the Basic Multilingual Plane), it may still not be able to store all the characters that you want.

I will develop @StuiterSlurf answer and focus on details of utf8mb4_unicode_ci/utf8mb4_unicode_520_ci:

As you can read here (Peter Gulutzan) there is problem with sorting/comparing polish letter "Ł" (L with stroke) (lower case: "ł"; html esc: ł and Ł ) - we have following assumption in coding (same with mb4):

utf8_polish_ci      Ł greater than L and less than M
utf8_unicode_ci     Ł greater than L and less than M
utf8_unicode_520_ci Ł equal to L
utf8_general_ci     Ł greater than Z

In polish language letter Ł is after letter L and before M. And for different coding system you will get different sorting results. No one of this coding is better or worse - it depends of your needs.

What's the difference between utf8_general_ci and utf8_unicode_ci , is a simplified set of sorting rules which aims to do as well as it can while taking many short-cuts designed to improve speed. It does not follow the Unicode rules and will result in undesirable sorting or comparison in some situations, such as when using particular languages or characters. Ideally you want one with the exact rules of the language you'll be using but, since you're mixing languages, you'll have to opt for a generic collation. You can make an informed decision after reading Difference between utf8mb4_unicode_ci and utf8mb4_unicode_520_ci collations.

To see a bit more discussion of the actual differences, you can go to https://dev.mysql.com/worklog/task/?id=2673 and click "High Level Architecture".

What is the best collation to use for MySQL with PHP?, utf8mb4_unicode_ci . The character set, utf8 , only supports a small amount of UTF-8 code points, about 6% of possible characters. utf8 only supports the Basic Multilingual Plane (BMP). If it uses MySQL version 5.6 or more, it assumes the use of a new and improved Unicode Collation Algorithm (UCA) called “utf8mb4_unicode_520_ci”. This is great, unless you end up moving your WordPress site from a newer 5.6 version of MySQL to an older, pre 5.6 version of MySQL.

MySQL 8.0 Reference Manual :: 10.14 Adding a Collation , ? uft8mb4 means that each character is stored as a maximum of 4 bytes in the UTF-8 encoding scheme. (The Unicode Collation Algorithm is the method used to compare two Unicode strings that conforms to the requirements of the Unicode Standard). Few years later, when MySQL 5.5.3 was released, they introduced a new encoding called utf8mb4, which is actually the real 4-byte utf8 encoding that you know and love. Recommendation. if you're using MySQL (or MariaDB or Percona Server), make sure you know your encodings. I would recommend anyone to set the MySQL encoding to utf8mb4.

What is the utf8mb4_0900_ai_ci Collation?, MariaDB supports the following character sets and collations. Note that the Mroonga Storage Engine only supports a limited number of character sets. Yes | 1 | | utf8mb4_bin | utf8mb4 | 46 | | Yes | 1 | | utf8mb4_unicode_ci | utf8mb4 | 224 8 | | utf8mb4_unicode_520_ci | utf8mb4 | 246 | | Yes | 8 | | utf8mb4_vietnamese_ci  A collation name starts with the name of the character set with which it is associated, followed by one or more suffixes indicating other collation characteristics. For example, utf8_general_ci and latin_swedish_ci are collations for the utf8 and latin1 character sets, respectively.

Supported Character Sets and Collations, When you run SHOW COLLATION in MySQL or MariaDB, you will see a large utf8mb4_unicode_ci utf8mb4_unicode_520_ci That is confusing. What are the differences between the utf8 and utf8mb4 character sets? Benefits of utf8mb4_unicode_ci over utf8mb4_general_ci. utf8mb4_unicode_ci, which uses the Unicode rules for sorting and comparison, employs a fairly complex algorithm for correct sorting in a wide range of languages and when using a wide range of special characters. These rules need to take into account language-specific conventions; not

Comments
  • I found answers on SO here and here and I got an easy to understand explanation here.
  • It definitely depends on the application you want to build. That's why you can research this early in the start of your application then later. So you got a lot more languages with strange letters and every language needs anohter unicode.