Convert non-ASCII chars from ASCII-8BIT to UTF-8
ascii-8bit to utf-8 online
"\xc3" from ascii-8bit to utf-8
xf0'' from ascii-8bit to utf-8
incompatible character encodings: ascii-8bit and utf-8
ruby convert string to unicode
I'm pulling text from remote sites and trying to load it into a Ruby 1.9/Rails 3 app that uses utf-8 by default.
Here is an example of some offending text:
Cancer Res; 71(3); 1-11. ©2011 AACR.\n
That Copyright code expanded looks like this:
Cancer Res; 71(3); 1-11. \xC2\xA92011 AACR.\n
Ruby tells me that string is encoded as ASCII-8BIT and feeding into my Rails app gets me this:
incompatible character encodings: ASCII-8BIT and UTF-8
I can strip the copyright code out using this regex
to produce this
Cancer Res; 71(3); 1-11. ??2011 AACR.\n
But how can I get a copyright symbol (and various other symbols such as greek letters) converted into the same symbols in UTF-8? Surely it is possible...
I see references to using force_encoding but this does not work:
I realize there are many other people with similar issues but I've yet to see a solution that works.
This works for me:
#encoding: ASCII-8BIT str = "\xC2\xA92011 AACR" p str, str.encoding #=> "\xC2\xA92011 AACR" #=> #<Encoding:ASCII-8BIT> str.force_encoding('UTF-8') p str, str.encoding #=> "©2011 AACR" #=> #<Encoding:UTF-8>
3 steps to fix encoding problems in Ruby, Encoding::InvalidByteSequenceError: "\xFE" on UTF-8 And not all strings can be represented in all encodings: encode , which translates a string to another encoding (converting characters to their equivalent in the So, if you want a lot of control over the specific bytes in your string, ASCII-8BIT might be a good option. ASCII was the most commonly used character encoding on the World Wide Web until December 2007, when it was surpassed by UTF-8. UTF-8 allows for backward compatibility with 7-bit ASCII, wherein the first 128 ASCII characters were incorporated into UNICODE and have the same numeric codes in both.
There are two possibilities:
The input data is already UTF-8, but Ruby just doesn't know it. That seems to be your case, as "\xC2\xA9" is valid UTF-8 for the copyright symbol. In which case you just need to tell Ruby that the data is already UTF-8 using force_encoding.
For example "\xC2\xA9".force_encoding('ASCII-8BIT') would recreate the relevant bit of your input data. And "\xC2\xA9".force_encoding('ASCII-8BIT').force_encoding('UTF-8') would demonstrate that you can tell Ruby that it is really UTF-8 and get the desired result.
The input data is in some other encoding and you need Ruby to transcode it to UTF-8. In that case you'd have to tell Ruby what the current encoding is (ASCII-8BIT is ruby-speak for binary, it isn't a real encoding), then tell Ruby to transcode it.
For example, say your input data was ISO-8859-1. In that encoding the copyright symbol is just "\xA9". This would generate such a bit of data: "\xA9".force_encoding('ISO-8859-1') And this would demonstrate that you can get Ruby to transcode that to UTF-8: "\xA9".force_encoding('ISO-8859-1').encode('UTF-8')
Convert non-ASCII chars from ASCII-8BIT to UTF-8, Convert non-ASCII chars from ASCII-8BIT to UTF-8. ruby utf-8 internationalization. I'm pulling text from remote sites and trying to load it into a Ruby 1.9/Rails 3 I have a UTF-8 string in from which i want to find out which are non-ASCII characters. lets say i have char arr = "x√ab c";, and it has 1 non-ASCII character (√') one way it to find the ascii characters from given UTF-8 string , excluding those i'll get the non-ASCII characters.
I used to do this for a script that scraped Greek Windows-encoded pages, using open-uri, iconv and Hpricot:
doc = open(DATA_URL) doc.rewind data = Hpricot(Iconv.conv('utf-8', "WINDOWS-1253", doc.readlines.join("\n")))
I believe that was Ruby 1.8.7, not sure how things are with ruby 1.9
Converting between ASCII-8BIT and UTF-8 - Ruby, ASCII-8BIT, whose alias is BINARY btw., is not a meaningful operation. If you can guarantee you have only 7-Bit characters in your ASCII-8Bit For example "\xC2\xA9".force_encoding('ASCII-8BIT') would recreate the relevant bit of your input data. And "\xC2\xA9".force_encoding('ASCII-8BIT').force_encoding('UTF-8') would demonstrate that you can tell Ruby that it is really UTF-8 and get the desired result. The input data is in some other encoding and you need Ruby to transcode it to UTF-8.
I've been having issues with character encoding, and the other answers have been helpful, but didn't work for every case. Here's the solution I came up with that forces encoding when possible and transcodes using '?'s when not possible. Here's the solution:
def encode str encoded = str.force_encoding('UTF-8') unless encoded.valid_encoding? encoded = str.encode("utf-8", invalid: :replace, undef: :replace, replace: '?') end encoded end
force_encoding works most of the time, but I've encountered some strings where that fails. Strings like this will have invalid characters replaced:
str = "don't panic: \xD3" str.valid_encoding? false str = str.encode("utf-8", invalid: :replace, undef: :replace, replace: '?') "don't panic: ?" str.valid_encoding? true
Update: I have had some issues in production with the above code. I recommend that you set up unit tests with known problem text to make sure that this code works for you like you need it to. Once I come up with version 2 I'll update this answer.
ASCII-8BIT to UTF-8 (Encoding::UndefinedConversionError) error , ASCII-8BIT to UTF-8 (Encoding::UndefinedConversionError) error # Conversion Error with JSON Reporter #3815 Alternatives here: https://stackoverflow.com/questions/3001177/how-do-i-grep-for-all-non-ascii-characters. Depending on your machine you can try piping your strings through iconv -f utf-8 -t ascii//translit (or whatever your encoding is, if it's not utf-8)
class String - Documentation for Ruby 2.6.0, Convert self to UTF-16 Returns true for a string which has only ASCII characters. Returns a copied string whose encoding is ASCII-8BIT. Non-ASCII case mapping/folding is currently supported for UTF-8, UTF-16BE/LE, UTF-32BE/LE, To convert an ASCII string to UTF-8, do nothing: they are the same. So if your UTF-8 string is composed only of ASCII characters, then it is already an ASCII string, and no conversion is necessary. If the UTF-8 string contains non-ASCII characters (anything with accents or non-Latin characters), there is no way to convert it to ASCII.
UTF-8, UTF-8 (8-bit Unicode Transformation Format) is a variable-width character encoding capable of The first 128 characters of Unicode, which correspond one-to-one with ASCII, are encoded using a single byte with Since ASCII bytes do not occur when encoding non-ASCII code points into UTF-8, UTF-8 is safe to use within The \u####-\u#### says which characters match.\u0000-\u007F is the equivilent of the first 255 characters in utf-8 or unicode, which are always the ascii characters. So you match every non ascii character (because of the not) and do a replace on everything that matches. – Gordon Tucker Dec 11 '09 at 21:11
Unicode, UTF-8, ASCII, and SNOMED CT®, The Unicode standard maps code points to a set of characters (including The conversion from UTF-8 to the various 8-bit ASCII extensions is not wholly Stack Overflow for Teams is a private, secure spot for you and your coworkers to find and share information. Learn more Python script to convert from UTF-8 to ASCII [duplicate]
- How are you pulling text from the remote sites? Scraping pages? Please show some sample code, including the HTTP client you are using, and whether you are parsing the pages using Nokogiri, Hpricot or ReXML. This problem could be a result of how you are retrieving the page, and/or how you are parsing the page. Once we know you're pulling the content in a data-safe manner, we can help you with converting the data between code sets.
- Real simple code - open-uri and nokogiri - e.g. doc = Nokogiri::XML(open(url)) then doc.css(...).text to pull out the relevant blocks of text
- Please show sample code. Is the file you are retrieving HTML or XML? Nokogiri does care about the difference when parsing. Also, provide some URLs, because every site on the internet is different.
- "I see references to using force_encoding but this does not work" What does "does not work mean"? Does it raise an error? Does Ruby segfault? Does your computer catch on fire? Does it replace the contents of the string with the lyrics to Yankee Doodle Dandy? Details, please! :)
- This might result in
invalid byte sequence in UTF-8error. I would suggest you to use
- That works for me too, but other strings don't. For example: str = "Diario El d\xEDa Bolivia" will not convert to "Diario El día Bolivia".
- Thats weird, the
"\xC2\xA92011 AACR"snippet returns UTF-8 for me
"©2011 AACR" #<Encoding:UTF-8>
- @MikeR Do you have an encoding magic comment at the top of your file?
- @Phrogz nope, I just opened an irb session (I'm using ruby-2.2.1 on ubuntu) and copy pasted those 2 lines.
- I was getting this kind of error with rails + sql server. Solved setting "encoding: ISO-8859-1" in database.yml and using "lating string".encode("UTF-8")
- Perfect. #2 solved my issue, pulling via Ruby/DBI from Sql Server also. @Lucas Renan: Thanks for heads up on rails/database.yml. I may add rails later for site.
- Thank you! None of the above were handling "\x96" for me, would still blow up. Iconv.conv('utf-8', "WINDOWS-1253", str) worked perfect.
- You may also need to set Iconv's
transliteratevalue to true. ruby-doc.org/stdlib-1.9.2/libdoc/iconv/rdoc/…