Converting special charactes such as Ã¼ and Ãƒ back to their original, latin alphbet counterparts in C#
I have been given an export from a MySQL database that seems to have had it's encoding muddled somewhat over time and contains a mix of
HTML char codes such as
& uuml; and more problematic characters representing the same letters such as
Ãƒ. It is my task to to bring some consistency back to the file and get everything into the correct Latin characters, e.g.
An example of the sort of string I am dealing with is
DesinfektionslÃƒÂ¶sungstÃƒÂ¼cher fÃƒÂ¼r FlÃƒÂ¤chen
Which should equate to
50 Tattoo Desinfektionsl ö sungst ü cher f ü r Fl ä chen 50 Tattoo Desinfektionsl ÃƒÂ¶ sungst ÃƒÂ¼ cher f ÃƒÂ¼ r Fl ÃƒÂ¤ chen
Is there a method available in C#/.Net 4.5 that would successfully re-encode the likes of
Else what approach would be advisable?
Also is the paragraph character
¶ in the above example string an actual paragraph character or part of some other character combination?
I have created a lookup table in the case of needing to do find and replace which is below, however I am unsure as to how complete it is.
Ã‰ -> É â€œ -> " â€ -> " Ã‡ -> Ç Ãƒ -> Ã Ã©, 'é Ã -> À Ãº -> ú â€¢ -> - Ã˜ -> Ø Ãµ -> õ Ã -> í Ã¢ -> â Ã£ -> ã Ãª -> ê Ã¡ -> á Ã© -> é Ã³ -> ó â€" -> – Ã§ -> ç Âª -> ª Âº -> º Ã -> à
Well, first of all, as the data has been decoded using the wrong encoding, it's likely that some of the characters are impossible to recover. It looks like it's UTF-8 data that incorrectly decoded using an 8-bit encoding.
There is no built in method to recover data like this, because it's not something that you normally do. There is no reliable way to decode the data, because it's already broken.
What you can try, is to encode the data, and decode it using the wrong encoding again, just the other way around:
byte data = Encoding.Default.GetBytes(input); string output = Encoding.UTF8.GetString(data);
Encoding.Default uses the current ANSI encoding for your system. You can try some different encodings there and see which one gives the best result.
Replacing special characters, How would you find and replace a special character such as a paragraph mark? HTML special character converter. The easiest way to set a charset in your HTML is by using the Content-Type META tag. But if for some reason you cannot define a character set in your HTML files, you can HTML-encode special characters (such as characters with accents or the €-character).
The data is only partly unrecoverable due to Windows-1252 encoding having 5 unassigned slots. Some modifications of Windows-1252 fill these with control characters but those don't make it to posts in Stackoverflow. If modified Windows-1252 has been used you can fully recover as long as you don't lose the hidden control characters in copy pastes.
There is also the non-breaking space character that is ignored or turned into a space usually with copypastes, but that's not an issue when you deal with bytes directly.
The misencoding abuse this string has gone through is:
UTF-8 -> Windows-1252 -> UTF-8 -> Windows-1252
To recover, here is an example:
String a = "DesinfektionslÃƒÂ¶sungstÃƒÂ¼cher fÃƒÂ¼r FlÃƒÂ¤chen"; Encoding utf8 = Encoding.GetEncoding(65001); Encoding win1252 = Encoding.GetEncoding(1252); string result = utf8.GetString(win1252.GetBytes(utf8.GetString(win1252.GetBytes(a)))); Console.WriteLine(result); //Desinfektionslösungstücher für Flächen
SQL Server STRING_ESCAPE Function By Examples, or item you want to find and any text for which you want to search. In addition to ASCII Printable Characters, the ASCII standard further defines a list of special characters collectively known as ASCII Control Characters. Such characters typically are not easy to detect (to the human eye) and thus not easily replaceable using the REPLACE T-SQL function. Table 2 shows a sample list of the ASCII Control Characters.
It's probably windows-1252 encoded string which you read as UTF-8.
As Guffa mentioned data has been corrupted.
Lets take a look on bytes: ö -> C3B6 in UTF8
in windows-1252 C3 ->Ã B6 ->¶
so ö ->Ã¶
what about all these "ƒÂ":
ƒ ->83 Â ->C2
Honesty i don't know why they appear, but you can try erase them and do some conversions as Guffa mentioned. Good luck
Acrobat Pro not converting special characters such as ∧ and , I'm using Acrobat Pro 10.1.2 to convert web pages to PDF. My problem is that special characters such as ∧ and ✓ are not converted, they are blank or i. "How to convert the special characters such as #,@,! etc while loading it from a flat file to DSO" hi all, i would like to load texts which contains the special characters such as #, @, !, % etc as it is from flat file (.csv) to DSO. could anyone tel me how can i do this.
Here you can find a completer list:
Finding and replacing non-printing characters (such as paragraph , If you use the Special button, a special code representing the non-printing character will be To correctly obtain such special characters, which have decimal code points above 255, another option is to use or type a character's hex equivalent code point first, then press Alt+X keys. To do this, open or start WordPad, Word , etc. editing application software, (this Alt + X process will not work in Internet Explorer, Notepad, etc.).
I've been troubled by this char problem before. Solution:
My .(cs)html file was UTF-8; I converted to UTF-8Y (UTF-8 with a BOM).
SQL replace: How to replace ASCII special characters in SQL Server, One aspect of transforming source data that could get complicated relates to the removal of ASCII special characters such as new line For instance, the "#" character needs to be encoded because it has a special meaning of that of an html anchor. The <space> character needs to be encoded because it is not a valid URL character. Also, some characters, such as "~" might not transport properly across the internet. Consider the example where a parameter is supplied in a URL and
URL Decoder—URL Converter—Remove special characters from , But the address has been encoded with all kinds of strange characters, such as "%20". The characters are formed with a "%" symbol and a hexadecimal number. Useful Tips for Handling and Creating Special Characters in SAS®, continued 2 We can access a list of all available values in the current SAS session and their corresponding SAS byte value by executing the following code and looking at the log. Output 2 is a condensed screenshot of the log which has isolated three special characters of interest.
Choosing & applying a character encoding, Numerous scripts, such as Arabic and Indic, require additional rules to transform the character sequence in memory to an appropriate sequence of font glyphs This block of characters is intended to indicate a global region, eg "France". As such some tools use short sequences of Regional Indicators to encode flags. The idea is that the same two-letter country codes used in domain names would be mapped into this block to represent that region, eg, with a flag.