Fastest way to detect which character is different than a specific encoding

character encoding converter
character encoding list
check file encoding online
utf-8 characters not displayed correctly
utf-8 encoding
character encoding html
encoding examples
auto detect encoding online

At the moment I have an exception that tells me when the full line contains an invalid ISO 8859-1 character but I would like to detect exactly which one is it.

I could check each and every character in the string but that would be quite inefficient.

The purpose of this is to report to the user of the tool that they wrote an invalid character like €

Input:

Hello fri€nd

Output:

Error in € (index 9)

Is there any fast and efficient way to achieve that?

Snippet of the actual method:

public void writeLine(String line) throws EncodingException {
    try {
        if (!Charset.forName("ISO-8859-1" ).newEncoder().canEncode(line)) throw new EncodingException();
        bufferedWriter.write(line);
        bufferedWriter.newLine();
    } catch (IOException e) {
        e.printStackTrace();
    }   
}

i could check each and every char in the string but that would be quite inneficient

What do you think canEncode is doing? There's no way to check all of the characters without checking all of the characters.

If your String is really long you may see some benefit from using parallel streams:

final OptionalInt firstInvalidChar = line.chars()
    .parallel()
    .filter(ch -> !Charset.forName("ISO-8859-1").newEncoder().canEncode((char) ch))
    .findFirst();

if (firstInvalidChar.isPresent()) {
    throw new EncodingException(
        "The first invalid char is: " + (char) firstInvalidChar.getAsInt()
    );
}

If the Charset were thread-safe you could see some performance improvement by creating a single instance rather than lots, but as its an abstract factory with nothing in the documentation, we have to assume that it's not.

How to check if a .txt file is in ASCII or UTF-8 format in Windows , How can I tell if a file is UTF 8 encoded? While there’s no easy way to detect all of the possible encodings, by checking the byte order mark (BOM) there is a pretty straight forward way to detect the following encodings: UTF-16 UTF-32

You can try to use Apache Tika to detect the encoding of a String.

Example:

CharsetDetector detector = new CharsetDetector();
detector.setText(string.getBytes());
detector.detect();

Then you can convert your string from the original charset to anyone:

detector.getString(yourStr.getBytes(), "utf-8");

What is the difference between UTF-8 and ISO-8859-1?, What is the difference between UTF 8 and ISO 8859 1? The Unicode standard (a map of characters to code points) defines several different encodings from its single character set. UTF-8 as well as its lesser-used cousins, UTF-16 and UTF-32, are encoding formats for representing Unicode characters as binary data of one or more bytes per character.

Step 1: Memory Encoding, is a single-byte encoding that can represent the first 256 Unicode characters. Both encode ASCII exactly the same way. However, it might be an ISO-8859-1 file which happens to start with the characters . Or it might be a different file type entirely. Notepad++ does its best to guess what encoding a file is using, and most of the time it gets it right.

What are reasons to use UTF-16 instead of UTF-8 in some situations , A Unicode encoding also allows many more languages to be mixed on a single page than any other choice of encoding. Support for a given encoding, even a  Characters are abstract entities that can be represented in many different ways. A character encoding is a system that pairs each character in a supported character set with some value that represents that character. For example, Morse code is a character encoding that pairs each character in the Roman alphabet with a pattern of dots and dashes

Choosing & applying a character encoding, What is the difference here? Files generally indicate their encoding with a file header. it might be an ISO-8859-1 file which happens to start with the characters  . describes them as "UCS-2" since it doesn't support certain facets of UTF-16. In case it's not, all you can do is a “smart guess” but the result is often  Recommend:java - How to detect the character encoding of a file are of the encoding type that we support (i.e. UTF-8, Shift-JIS, EUC-JP), and once that file is validated, we would also need to save that file in our system and its encoding as meta-data.

Character encodings for beginners, This article is aimed at developers (with a focus on PHP), but any computer If you can keep this in your head for 26 letters or are really fast with looking stuff There are 95 human readable characters specified in the ASCII table, How these code points are actually encoded into bits is a different topic. In Firefox go to View > Character Encoding. Swap between a few to see what effect it has. If you try to display more than 256 characters, the sequence will repeat. Summary Circa 1990. This is the situation in about 1990. Documents can be written, saved and exchanged in many languages, but you need to know which character set they use. There is

Comments
  • You could split the line in 2 equal parts and check which part contains the error. Then recursively do the same thing on the part with the error.
  • That would be divide and conquer, not a bad option indeed! @RobertKock
  • There's no need to add "Thanks" at the end of posts. See meta for more info
  • Sorry about that @Zoe I will check it for sure!
  • Seems like a really fast way of doing it and does exactly what I was asking.
  • it is more a detection and reporting of error. Something like: Input: Hello€ Output: Error in € value