Removing non-ASCII characters from data files

Related searches

I've got a bunch of csv files that I'm reading into R and including in a package/data folder in .rdata format. Unfortunately the non-ASCII characters in the data fail the check. The tools package has two functions to check for non-ASCII characters (showNonASCII and showNonASCIIfile) but I can't seem to locate one to remove/clean them.

Before I explore other UNIX tools, it would be great to do this all in R so I can maintain a complete workflow from raw data to final product. Are there any existing packages/functions to help me get rid of the non-ASCII characters?

To simply remove the non-ASCII characters, you could use base R's iconv(), setting sub = "". Something like this should work:

x <- c("Ekstr\xf8m", "J\xf6reskog", "bi\xdfchen Z\xfcrcher") # e.g. from ?iconv
Encoding(x) <- "latin1"  # (just to make sure)
x
# [1] "Ekstrøm"         "Jöreskog"        "bißchen Zürcher"

iconv(x, "latin1", "ASCII", sub="")
# [1] "Ekstrm"        "Jreskog"       "bichen Zrcher"

To locate non-ASCII characters, or to find if there were any at all in your files, you could likely adapt the following ideas:

## Do *any* lines contain non-ASCII characters? 
any(grepl("I_WAS_NOT_ASCII", iconv(x, "latin1", "ASCII", sub="I_WAS_NOT_ASCII")))
[1] TRUE

## Find which lines (e.g. read in by readLines()) contain non-ASCII characters
grep("I_WAS_NOT_ASCII", iconv(x, "latin1", "ASCII", sub="I_WAS_NOT_ASCII"))
[1] 1 2 3

Removing all non-ascii characters from a workflow (file), ASCII characters are characters in the range from 0 to 177 (octal) inclusively. To delete characters outside of this range in a file, use. LC_ALL=C tr -dc '\0-\177'� I've got a bunch of csv files that I'm reading into R and including in a package/data folder in .rdata format. Unfortunately the non-ASCII characters in the data fail the check.

These days, a slightly better approach is to use the stringi package which provides a function for general unicode conversion. This allows you to preserve the original text as much as possible:

x <- c("Ekstr\u00f8m", "J\u00f6reskog", "bi\u00dfchen Z\u00fcrcher")
x
#> [1] "Ekstrøm"         "Jöreskog"        "bißchen Zürcher"

stringi::stri_trans_general(x, "latin-ascii")
#> [1] "Ekstrom"          "Joreskog"         "bisschen Zurcher"

replace_non_ascii: Replace Common Non-ASCII Characters in , x. The text variable. replacement. Character string equal in length to pattern or of length one which are a replacement for matched pattern. remove.nonconverted. If you really want to strip it, try: import unicodedataunicodedata.normalize('NFKD', title).encode('ascii','ignore') * WARNING THIS WILL MODIFY YOUR DATA *It attempts to find a close match - i.e. ć -> c. Perhaps a better answer is to use unicodecsvinstead.

To remove all words with non-ascii characters (borrowing code from @Hadley), you can use the package xfun with filter from dplyr

x <- c("Ekstr\u00f8m", "J\u00f6reskog", "bi\u00dfchen Z\u00fcrcher", "alex")
x

x %>% 
  tibble(name = .) %>%
  filter(xfun::is_ascii(name)== T)

Java remove non-printable non-ascii characters using regex, we may want to remove non-printable characters before using the file into the application because they prove to be problem when we start data processing on this� Removing these non-ASCII characters from a file Hi, I have many text files which contain some non-ASCII characters. I attach the screenshots of one of the files for people to have a look at.

Remove non-printable ASCII characters from a file with this Unix , Remove the garbage characters with the Unix 'tr' command. To fix this problem, and get the binary characters out of your files, there are several� I am reading data from csv files which has about 50 columns, few of the columns(4 to 5) contain text data with non-ASCII characters and special characters. df = spark.read.csv(path, header=True, schema=availSchema) I am trying to remove all the non-Ascii and special characters and keep only English characters, and I tried to do it as below

How to remove non-ASCII characters in Python, Removing non-ASCII characters results in a string that only contains ASCII characters. For example, removing non-ASCII characters from "�a string with� fu�nny� To remove all non-ASCII characters, you can use following replacement: [^\x00-\x7F]+ To highlight characters, I recommend using the Mark function in the search window: this highlights non-ASCII characters and put a bookmark in the lines containing one of them

At present, I'm stripping those too. Here's the code: def onlyascii(char): if ord(char) < 48 or ord(char) > 127: return '' else: return char def get_my_string(file_path): f=open(file_path,'r') data=f.read() f.close() filtered_data=filter(onlyascii, data) filtered_data = filtered_data.lower() return filtered_data.

Comments
  • Try with regular expressions, for instance the function gsub. Check ?regexp
  • You are aware that read.csv() takes an encoding argument, so you can handle these, at least in R? What specific check do the non-ASCII characters fail, is it in R (if so post it here), or external?
  • Any thoughts how I can make it work with stringi -- iconv("Klinik. der Univ. zu K_ln (AA\u0090R)","latin1","ASCII",sub="") => [1] "Klinik. der Univ. zu K_ln (AAR)" but stringi::stri_trans_general("Klinik. der Univ. zu K_ln (AA\u0090R)", "latin-ascii") => [1] "Klinik. der Univ. zu K_ln (AA\u0090R)"
  • stringi::stri_trans_general(x, "latin-ascii") removes some of the non-ASCII characters in my text, but not others. tools::showNonASCII reveals the non-removed characters are:zero width space, trademark sign, Euro sign, narrow no-break space. Does this mean "latin-ascii" is the wrong transform identifier for my string? Is there a straightforward way to figure out the correct transform identifier? Thanks