R: Incorrect encoding of narrow space in data frame and resulting .csv

I scraped data and received some character variables containing a narrow no break space (unicode U+202F). The resulting character variable shows up fine in R if it is in a vector. For example, the return of test shows up with a narrow space in the console:

test <- "variable1&#8239;variable2"
<br>
test

The offender is format.default:

test <- "variable1\u202Fvariable2"
print(test)
[1] "variable1 variable2"
format(test)
#[1] "variable1<U+202F>variable2"

format gets called by format.data.frame which in turn is called by print.data.frame.

A solution might be to define a character method:

format.character <- function(x, ...) x

DF <- data.frame(x = 1:5) #beware of stringsAsFactors
DF$test <- test
DF #spaces are actually thin spaces in R console
#  x                test
#1 1 variable1 variable2
#2 2 variable1 variable2
#3 3 variable1 variable2
#4 4 variable1 variable2
#5 5 variable1 variable2

Obviously, such a simple method will break functions relying on other format arguments.

OTOH, why do you care how thin spaces are printed?

[PDF] An introduction to data cleaning with R, 2.2 Reading text data into a R data.frame . In this tutorial a statistical analysis is viewed as the result of a number of data character encoding and so on. is NULL) and has length 0 so it does not take up any space in a vector. In the first attempt, read.csv interprets the first line as column headers and  3 R: Incorrect encoding of narrow space in data frame and resulting .csv Jan 28 '19 3 Using sklearn StandardScaler on only select columns Sep 27 '19 2 Blocking in cross validation in mlr with subject id May 10 '19

Anbody having the same problem: There is a package called textclean which replaces or removes non-ascii characters by replace_non_ascii().

[PDF] R Data Import/Export, the conditions for verbatim copying, provided that the entire resulting derived work remembering that R like S comes from the Unix tradition of small re-​usable which 8-bit encoding (although guesses may be possible and file may guess as it frame (or an object that can be coerced to a data frame) with row and column  Open a .csv file in R using read.csv()and understand why we are using that file type. Work with data stored in different columns within a data.frame in R. Examine R object structures and data classes. Convert dates, stored as a character class, into an R date class. Create a quick plot of a time-series dataset using qplot.

One method is to convert all unicode characters to blank using gsub:

text <- "variable1\u202Fvariable2"
new_text <- gsub("[^\x20-\x7E]", " ", text)

Here I match the negation of all commonly used ASCII characters, ranging from hex code 20 (SPACE) to 7E (~). The disadvantage of this method is that you might unintentionally remove more than what you wish, but you can always add exclusions to the character class.

Output:

> format(text)
[1] "variable1<U+202F>variable2"

> format(new_text)
[1] "variable1 variable2"

Data Input, Reads a file in table format and creates a data frame from it, with cases fileEncoding = "", encoding = "unknown", text, skipNul = FALSE) read.csv(file, If sep = "" (the default for read.table ) the separator is 'white space', that is one or more Notice that a literal string can be used to include (small) data sets within R code. In order to export the data-frame into CSV we can use the below code. > write.csv(df, 'C:\\Users\\Pantar User\\Desktop\\Employee.csv', row.names = FALSE) In the above line of code, we have provided a path directory for our data fame and stored the dataframe in CSV format.

Chapter 5 Importing data, R: Incorrect encoding of narrow space in data frame and resulting .csv. I scraped data and received some character variables containing a narrow no break  write.csv(data, file = "data.csv", row.names = FALSE) Any columns in a data frame which are lists or have a class (e.g. dates) will be converted by the

Chapter 24 String processing, A data scientist will rarely have such luck and will have to import data into R It is basically a file version of a data frame. The most common characters are comma ( , ), semicolon ( ; ), space ( ), and We will use the murders.csv file provided by the dslabs package as an example. Save the result to an object called dat . We have learned to write CSV file in R by using R base functions. The write.csv(), write.csv2() and write.table() can be used to write CSV file. Note that the write.table() function to write different types of text files in R. Here are other posts related to process data in R, you may refer to them if you’re interested in: R – How To Order

4. Input and Output - R Cookbook [Book], As a result, we don't necessarily have to memorize all the function names. str_trim, Replace, Remove white space from start and end of string. str_conv, Manipulate, Change the encoding of the string. Another problem we have are spaces. We instead have to read a csv file using the base R function readLines like  I am reading a file through RJDBC from a MySQL database and it correctly displays all letters in R (e.g., נווה שאנן). However, even when exporting it using write.csv and fileEncoding="UTF-8" the ou

Comments
  • What's wrong with using gsub, or regex for that matter?
  • It is error prone and a workaround rather than a solution.
  • I do not care how they are printed but the observations in the resulting csv file should exactly be the scraped text, i.e. "variable1 variable2", not "variable1<U+202F>variable2"
  • Well, you are obviously in encoding hell. I don't know how to fix the file export.
  • Would it be possible to post reproducible code including the scraping ?