JSON character encoding - is UTF-8 well-supported by browsers or should I use numeric escape sequences?
I am writing a webservice that uses json to represent its resources, and I am a bit stuck thinking about the best way to encode the json. Reading the json rfc (http://www.ietf.org/rfc/rfc4627.txt) it is clear that the preferred encoding is utf-8. But the rfc also describes a string escaping mechanism for specifying characters. I assume this would generally be used to escape non-ascii characters, thereby making the resulting utf-8 valid ascii.
So let's say I have a json string that contains unicode characters (code-points) that are non-ascii. Should my webservice just utf-8 encoding that and return it, or should it escape all those non-ascii characters and return pure ascii?
EDIT: I wanted to clarify that my main concern about how to encode the results is really about browser handling of the results. What I've read indicates that browsers may be sensitive to the encoding when using JSONP in particular. I haven't found any really good info on the subject, so I'll have to start doing some testing to see what happens. Ideally I'd like to only escape those few characters that are required and just utf-8 encode the results.
Choosing & applying a character encoding, The JSON spec requires UTF-8 support by decoders. As a result, all JSON decoders can handle UTF-8 just as well as they can handle the numeric escape� In addition to UTF-8 being very common in general, as already pointed out, it is the default encoding that JSON specification explicitly defines to be used, unless one of other Unicode encodings (UTF-16, UTF-32) is used. In fact, the first specification did not allow use of non-Unicode encodings such as Latin-1 (ISO-8859–1).
I had a problem there. When I JSON encode a string with a character like "é", every browsers will return the same "é", except IE which will return "\u00e9".
Then with PHP json_decode(), it will fail if it find "é", so for Firefox, Opera, Safari and Chrome, I've to call utf8_encode() before json_decode().
Note : with my tests, IE and Firefox are using their native JSON object, others browsers are using json2.js.
Migrating to Unicode, Which character encoding should I use for my content, and how do I apply it to is stored in a computer as a sequence of bytes, which are numeric values. If you really can't use a Unicode encoding, check that there is wide browser support A Unicode-based encoding such as UTF-8 can support many languages and� This is not entirely necessary, since a JSON document can validly contain UTF-8 sequences of characters. The library should allow replacement of this encoding function if the user knows the decoding side can handle UTF-8 encoded JSON. tgockel added enhancement labels Sep 4, 2014
ASCII isn't in it any more. Using UTF-8 encoding means that you aren't using ASCII encoding. What you should use the escaping mechanism for is what the RFC says:
All Unicode characters may be placed within the quotation marks except for the characters that must be escaped: quotation mark, reverse solidus, and the control characters (U+0000 through U+001F)
JSON, Which Unicode encoding is used (UTF-8 or UTF-16)? Which HTML: When serving pages to desktop browsers, use UTF-8; support JSON: Outgoing JSON data should always be encoded in UTF-8 and JSON is a good alternative. numeric character references (NCRs) for non-ASCII characters that� unicode-escape is not necessary: you could use json.dumps(d, ensure_ascii=False).encode('utf8') instead. And it is not guaranteed that json uses exactly the same rules as unicode-escape codec in Python in all cases i.e., the result might or might not be the same in some corner case.
I was facing the same problem. It works for me. Please check this.
Reading the json rfc (http://www.ietf.org/rfc/rfc4627.txt) it is clear that the preferred encoding is utf-8.
RFC 8259 states:
8.1. Character Encoding
JSON text exchanged between systems that are not part of a closed ecosystem MUST be encoded using UTF-8 [RFC3629].
Previous specifications of JSON have not required the use of UTF-8 when transmitting JSON text. However, the vast majority of JSON-based software implementations have chosen to use the UTF-8 encoding, to the extent that it is the only encoding that achieves interoperability.
Implementations MUST NOT add a byte order mark (U+FEFF) to the beginning of a networked-transmitted JSON text. In the interests of interoperability, implementations that parse JSON texts MAY ignore the presence of a byte order mark rather than treating it as an error.
Newline is a control character or sequence of control characters in a character encoding During the period of 1963 to 1968, the ISO draft standards supported the use of This is why a newline in character encoding can be defined as LF and CR also provides a "next line" ( NEL) control code, as well as control codes for� Yet the UTF-8 encoding somehow squeezes these into much smaller spaces by using something called "variable-width encoding". In fact, it manages to represent the first 127 characters of US-ASCII in just one byte which looks exactly like real ASCII, so you can interpret lots of ascii text as if it were UTF-8 without doing anything to it. Neat trick.
The pretty picture also lists all of the legitimate escape sequences within a JSON string: \" \\ \/ \b \f \r \t \u followed by four-hex-digits; Note that, contrary to the nonsense in some other answers here, \' is never a valid escape sequence in a JSON string. It doesn't need to be, because JSON strings are always double-quoted.
Use. Since 2009, UTF-8 has been the most common encoding for the World Wide Web. The World Wide Web Consortium recommends UTF-8 as the default encoding in XML and HTML, UTF-8 is the recommendation from the WHATWG for HTML and DOM specifications, and the Internet Mail Consortium recommends that all e-mail programs be able to display and create mail using UTF-8.
(Only ASCII characters are encoded with a single byte in UTF-8.) UTF-8 is the most widely used way to represent Unicode text in web pages, and you should always use UTF-8 when creating your web pages and databases. But, in principle, UTF-8 is only one of the possible ways of encoding Unicode characters.
- "all JSON decoders can handle UTF-8" While this is true of browsers, just because the standard requires it doesn't mean all software decoding JSON supports UTF-8.
- "All JSON decoders can handle UTF-8" is literally true. If something can't accept UTF-8, it's not a JSON decoder. It's may be similar to a JSON decoder, but it definitely isn't one.
- I guess that depends on what definition of JSON decoder you're using, but fair point :)
- The reason RFC 8259 specifies UTF-8 support as mandatory is that it's what the world standardized on. Previous obsolete specs defined strings as Unicode but didn't specify which encoding; implementations standardised on UTF-8 anyway and the updated spec reflects that.
- UTF-8 support isn't specified as mandatory in that RFC for any particular software as far as I can tell. The only mention of UTF-8 is that it must be used as the encoding for JSON exchanged outside of a closed system. This does not imply that all JSON decoders (a language not used in the RFC) must support UTF-8.
- Probably you meant
- If IE is failing to decode that, it's a bug in whatever JSON decoder you're using. All JSON decoders must successfully decode the encoded form, or they're not a JSON decoder. As for your issue with json_decode() with the é unescaped, it's possible that the text you're feeding it isn't UTF-8. JSON decoders always assume UTF-8, even the PHP implementation, even though PHP doesn't normally assume UTF-8 in many other functions. There are other character encodings which can include an é unescaped and look identical on screen, but which aren't UTF-8. Encoding in \uXXXX form is a workaround to this.
- Just saying: JSON can legally come in any Unicode encoding (UTF-8, UTF-16 BE/LE, UTF32 BE/LE, with or without byte order marker). And since ASCII is a subset of UTF-8, it can also come in ASCII. Whether parsers accept UTF-32 for example, I don't know.
- That is correct, and parsers aren't required to support anything other than UTF-8. From the spec: "JSON text SHALL be encoded in UTF-8, UTF-16, or UTF-32. The default encoding is UTF-8, and JSON texts that are encoded in UTF-8 are interoperable in the sense that they will be read successfully by the maximum number of implementations; there are many implementations that cannot successfully read texts in other encodings (such as UTF-16 and UTF-32). Implementations MUST NOT add a byte order mark to the beginning of a JSON text."
- @thomasrutter The spec you quoted is old. The current spec says: "JSON text exchanged between systems that are not part of a closed ecosystem MUST be encoded using UTF-8. Previous specifications of JSON have not required the use of UTF-8 when transmitting JSON text. However, the vast majority of JSON-based software implementations have chosen to use the UTF-8 encoding, to the extent that it is the only encoding that achieves interoperability. Implementations MUST NOT add a byte order mark (U+FEFF) to the beginning of a networked-transmitted JSON text."
- If read that quote you provided you'll see that you are not required to escape all unicode characters, only a few special characters. But you are required to encode the results (preferably with utf-8). So the question is: "Why bother escaping normal unicode characters if you're utf-8 encoding".
- Also, an ascii encoded string is a pure subset of utf-8. If I use json's escaping for all non-ascii characters, the result is ascii -- and therefore utf-8. Various json libraries (like python simplejson) have modes to force ascii results. I presume for a reason, like perhaps execution in browsers.
- When you bother escaping normal unicode characters is in contexts where they're metacharacters, like strings. (The RFC chunk I quoted is about strings; sorry, wasn't clear about that.) You don't need to do ASCII output all the time; I'd think that's more for debugging with broken browsers.
- It should be noted that the above is PHP, since the question is in no way PHP-specific and only talks about web service which also may not use PHP (as the older ones of our readers may still remember…)