Sanitizing bad UTF-8 strings

My gRPC service failed to send a request due to malformed user-data. Turns out the HR user-data has a bad UTF-8 string and gRPC could not encode it. I narrowed the bad field down to this string:

"Gr\351gory Smith" // Gr�gory Smith  (this is coming from an LDAP source)

So I want a way to sanitized such inputs should they contain bad UTF-8 encodings.

Not seeing any obvious sanitization functions in the unicode/utf8 standard package, here's my first naïve attempt:

func naïveSanitizer(in string) (out string) {
    for _, rune := range in {
        out += string(rune)
    }
    return
}

Output:

Before: Valid UTF-8? false  Name: 'Gr�gory Smith' Byte-Count:  13
After:  Valid UTF-8? true   Name: 'Gr�gory Smith' Byte-Count:  15

Playground version

Is there a better or more standard way to salvage as much valid data from a bad UTF-8 string?


The reason I have pause here is because while iterating the string and the bad (3rd) character is encountered, utf8.ValidRune(rune) returns true: https://play.golang.org/p/_FZzeTRLVls

So my follow-up question is, will iterating a string - one rune at a time - will the rune value always be valid? Even though the underlying source string encoding was malformed?


EDIT:

Just to clarify, this data is coming from an LDAP source: 500K user records. Of those 500K records only 15 (fifteen) i.e. ~0.03% return a uf8.ValidString(...) of false.

As @kostix and @peterSO have pointed out, the values may be valid if converted from another encoding (e.g. Latin-1) to UTF-8. Applying this theory to these outlier samples:

https://play.golang.org/p/9BA7W7qQcV3

Name:     "Jean-Fran\u00e7ois Smith" : (good UTF-8) :            : Jean-François Smith
Name:                   "Gr\xe9gory" : (bad  UTF-8) : Latin-1-Fix: Grégory
Name:               "Fr\xe9d\xe9ric" : (bad  UTF-8) : Latin-1-Fix: Frédéric
Name:                 "Fern\xe1ndez" : (bad  UTF-8) : Latin-1-Fix: Fernández
Name:                     "Gra\xf1a" : (bad  UTF-8) : Latin-1-Fix: Graña
Name:                     "Mu\xf1oz" : (bad  UTF-8) : Latin-1-Fix: Muñoz
Name:                     "P\xe9rez" : (bad  UTF-8) : Latin-1-Fix: Pérez
Name:                    "Garc\xeda" : (bad  UTF-8) : Latin-1-Fix: García
Name:                  "Gro\xdfmann" : (bad  UTF-8) : Latin-1-Fix: Großmann
Name:                     "Ure\xf1a" : (bad  UTF-8) : Latin-1-Fix: Ureña
Name:                    "Iba\xf1ez" : (bad  UTF-8) : Latin-1-Fix: Ibañez
Name:                     "Nu\xf1ez" : (bad  UTF-8) : Latin-1-Fix: Nuñez
Name:                     "Ba\xd1on" : (bad  UTF-8) : Latin-1-Fix: BaÑon
Name:                  "Gonz\xe1lez" : (bad  UTF-8) : Latin-1-Fix: González
Name:                    "Garc\xeda" : (bad  UTF-8) : Latin-1-Fix: García
Name:                 "Guti\xe9rrez" : (bad  UTF-8) : Latin-1-Fix: Gutiérrez
Name:                      "D\xedaz" : (bad  UTF-8) : Latin-1-Fix: Díaz
Name:               "Encarnaci\xf3n" : (bad  UTF-8) : Latin-1-Fix: Encarnación

You could improve your "sanitiser" by dropping invalid runes:

package main

import (
    "fmt"
    "strings"
)

func notSoNaïveSanitizer(s string) string {
    var b strings.Builder
    for _, c := range s {
        if c == '\uFFFD' {
            continue
        }
        b.WriteRune(c)
    }
    return b.String()
}

func main() {
    fmt.Println(notSoNaïveSanitizer("Gr\351gory Smith"))
}

Playground.

The problem though is that \351 is the character é in Latin-1.

@PeterSO pointed out it also happens to be at the same position in the Unicode's BMP, and that is correct but Unicode is not an encoding, and your data is supposedly encoded, so I think you just have an incorrect assumption about the encoding of your data and it's not UTF-8 but rather Latin-1 (or something compatible with regard to Latin accented letters).

So I'd verify you really are dealing with Latin-1 (or whatever) and if so, golang.org/x/text/encoding provides complete tooling for re-encoding from legacy encodings to UTF-8 (or whatever).

(On a side note, you might as well just not happen to explicitly ask your data source to provide you with UTF-8-encoded data.)

Sanitizing supposed UTF-8 data, JSON uses UTF-8 by default, and that's what I needed here. to UTF-8 and at the very least guarantee no invalid UTF-8 strings appeared in  Sanitizing supposed UTF-8 data. By Jon Jensen December 17, 2011 As time passes, it’s clear that Unicode has won the character set encoding wars, and UTF-8 is by far the most popular encoding, and the expected default.

Go 1.13 introduces strings.ToValidUTF8(), so sanitizer() should simply be:

func sanitize(s string) string {
    return strings.ToValidUTF8(s, "")
}

Which I don't even think deserves its own function. Try it on the Go Playground.

If your input happens to be a byte slice, you may use the similar bytes.ToValidUTF8() function.

Also note that if you don't just want to discard some data from your input without a trail, you may use any replacement character (or multiple characters) when calling strings.ToValidUTF8(), for example:

return strings.ToValidUTF8(in, "❗")

Try this one on the Go Playground.

Json::encode(): auto-sanitize bad UTF-8 strings · Issue #3444 , Json::encode(): auto-sanitize bad UTF-8 strings #3444 I expect Json::encode to be strict and to fail for invalid data. A failsafe variant is of course helpful for  So far I have come up with the following function which I hope solves this problem and allows foreign UTF-8 data also. /** * Convert a string to the file/URL safe "slug" form * * @param string $string the string to clean * @param bool $is_filename TRUE will allow additional filename characters * @return string */ function sanitize ($string = '', $is_filename = FALSE) { // Replace all weird characters with dashes $string = preg_replace ('/ [^\w\-'. ($is_filename ? '~_\.' : ''). ']+/u',

Fix your problem. \351 is the octal value of Unicode code point é.

package main

import "fmt"

func main() {
    fmt.Println(string(rune(0351)))
    fullname := "Grégory Smith" // "Gr\351gory Smith"
    fmt.Println(fullname)
}

Playground: https://play.golang.org/p/WigFZk3iSK1

Output:

é
Grégory Smith

Sanitising a UTF-8 string, it's a UTF-8 string by replacing all incorrect sequences by the replacement If the iteration encounters an invalid UTF-8 sequence, the Please fill all the letters into the box to prove you're human.

How to sanitize UTF-8 input efficiently? - PHP, Hello, I'm searching for ways to sanitize UTF-8 input on PHP 5.3. that the input is valid UTF-8, then remove invalid (or simply "discouraged) XML chars. current plane) it just returns a empty string for that (single) character. Rack::UTF8Sanitizer is a Rack middleware which cleans up invalid UTF8 characters in request URI and headers. Additionally, it cleans up invalid UTF8 characters in the request body (depending on the configurable content type filters) by reading the input into a string, sanitizing the string, then replacing the Rack input stream with a rewindable input stream backed by the sanitized string.

Secure Development for Mobile Apps: How to Design and Code Secure , PHP UTF-8 input validation, 293–297 invalid characters, 294 output escaping to HTML, 296 sanitize unicode email address, 296–297 sanitize unicode strings,​  The Big List of Naughty Strings is an evolving list of strings which have a high probability of causing issues when used as user-input data. This is intended for use in helping both automated and manual QA testing; useful for whenever your QA engineer walks into a bar .

Building Web Apps with WordPress: WordPress as an Application , value you wish to sanitize. sanitize_text_field($str ) Sanitizes any string input by a user or pulled from the database. Checks for invalid UTF-8; converts single  echo htmlentities($string, ENT_QUOTES, 'UTF-8'); A tool that you can use to sanitize your HTML with more deepens is called HTML Purifier. It is a library, it accepts a series of parameters that you can set beforehand and it is considered to be very reliable.

Comments
  • I don't have a solution for you, but the problem is that using range on the string already makes an assumption that it should be valid, and so it replaces any invalid byte sequence with a U+FFFD "replacement character" rune. Which is a valid for the purposes of ValidRune because it can be encoded. So you need an approach which doesn't try to range over the string.
  • Based on what @hobbs, said, I'd just extended your sanitizer with if c == '\uFFFD' { continue; } and called it a day.
  • I don't think it bad UTF-8; It just isn't UTF-8. I think the person's name is Grégory Smith and that is what is being set to you. Whoever is reading it as UTF-8 is making the error. Fix that and you don't have a problem, as well as not corrupting the person's name.
  • @TomBlodget I've edited the Q with more details: it is bad UTF-8 - but thankfully a tiny percentage.
  • The data-source is LDAP, and I can see valid UTF-8 strings like "Jean-Fran\303\247ois Smith" (i.e. "Jean-François Smith" playground). As I commented on @PeterSO 's answer there are 15 out of 500K records like this. I'll investigate your Latin-1 encoding theory - but with < %0.03 error rate, I suspect bad input software.
  • @colminator, BaÑon looks weird to me. Is this a typo (an uppercased letter entered by mistake)?
  • clearly there's bad input for the fact that this data isn't UTF8 to begin with. Yes that N looks odd. I tried a few other encoding conversions, on this outlier, with no success. I think this is as close as I can get given a bad (data) hand.
  • In the real world it is rarely a good idea to silently discard data.
  • Agreed. In a worse case scenario, I'd replace bit-rotted bytes with a '?' to leave some breadcrumb trail that something was not quite right. I think @kostix is on the right trail with his suggestion that my rogue characters are Latin-1 encoded. Will update the question shortly...
  • Yes, I agree. You could pass any character (or string) to appear in place of invalid sequences, e.g. strings.ToValidUTF8(in, "❗").
  • @icza - I wish I could have marked both of your answer as "accepted" - but @kostix 's Latin-1 encoding suggestion allowed for even cleaner data remediation.
  • @colminator That's fine. If his answer was the most helpful to solve your problem, then his answer should be the accepted one (as the tooltip on accepting answers suggests).
  • Appreciate the insight to the code-point error, but this does not solve the problem in an automated fashion. The strings are coming in from HR LDAP data: 500K records in total. Scanning the entire LDAP tree I've found 15 records with bad UTF-8 fields. I need a method to automatically sanitize these fields so they can be processed i.e. I don't want to skip a record, because HR screwed up.
  • And the LDAP data is being read in correctly, as I can see valid UTF-8 strings like "Jean-Fran\303\247ois Smith" (i.e. "Jean-François Smith")