## Chopping a string into a vector of fixed width character elements

r split string into vector

turn character into vector r

r split string by number of characters

substring r

stringr string to vector

r strsplit vector of strings

r cut string after character

I have an object containing a text string:

x <- "xxyyxyxy"

and I want to split that into a vector with each element containing two letters:

[1] "xx" "yy" "xy" "xy"

It seems like the `strsplit`

should be my ticket, but since I have no regular expression foo, I can't figure out how to make this function chop the string up into chunks the way I want it. How should I do this?

Using `substring`

is the best approach:

substring(x, seq(1, nchar(x), 2), seq(2, nchar(x), 2))

But here's a solution with plyr:

library("plyr") laply(seq(1, nchar(x), 2), function(i) substr(x, i, i+1))

**r,** Using substring is the best approach: substring(x, seq(1,nchar(x),2), seq(2,nchar(x),2)). But here's a solution with plyr: library("plyr") laply(seq(1,nchar(x),2), using a std::vector, so I can populate it at runtime and dynamically change the size of the vector, but keeping the member elements located inside fixed size blocks. Static initialization is not my question - I can do that using boost::assign or other tricks.

Here is a fast solution that splits the string into characters, then pastes together the even elements and the odd elements.

x <- "xxyyxyxy" sst <- strsplit(x, "")[[1]] paste0(sst[c(TRUE, FALSE)], sst[c(FALSE, TRUE)])

**Benchmark Setup:**

library(microbenchmark) GSee <- function(x) { sst <- strsplit(x, "")[[1]] paste0(sst[c(TRUE, FALSE)], sst[c(FALSE, TRUE)]) } Shane1 <- function(x) { substring(x, seq(1,nchar(x),2), seq(2,nchar(x),2)) } library("plyr") Shane2 <- function(x) { laply(seq(1,nchar(x),2), function(i) substr(x, i, i+1)) } seth <- function(x) { strsplit(gsub("([[:alnum:]]{2})", "\\1 ", x), " ")[[1]] } geoffjentry <- function(x) { idx <- 1:nchar(x) odds <- idx[(idx %% 2) == 1] evens <- idx[(idx %% 2) == 0] substring(x, odds, evens) } drewconway <- function(x) { c<-strsplit(x,"")[[1]] sapply(seq(2,nchar(x),by=2),function(y) paste(c[y-1],c[y],sep="")) } KenWilliams <- function(x) { n <- 2 sapply(seq(1,nchar(x),by=n), function(xx) substr(x, xx, xx+n-1)) } RichardScriven <- function(x) { regmatches(x, gregexpr("(.{2})", x))[[1]] }

**Benchmark 1:**

x <- "xxyyxyxy" microbenchmark( GSee(x), Shane1(x), Shane2(x), seth(x), geoffjentry(x), drewconway(x), KenWilliams(x), RichardScriven(x) ) # Unit: microseconds # expr min lq median uq max neval # GSee(x) 8.032 12.7460 13.4800 14.1430 17.600 100 # Shane1(x) 74.520 80.0025 84.8210 88.1385 102.246 100 # Shane2(x) 1271.156 1288.7185 1316.6205 1358.5220 3839.300 100 # seth(x) 36.318 43.3710 45.3270 47.5960 67.536 100 # geoffjentry(x) 9.150 13.5500 15.3655 16.3080 41.066 100 # drewconway(x) 92.329 98.1255 102.2115 105.6335 115.027 100 # KenWilliams(x) 77.802 83.0395 87.4400 92.1540 163.705 100 # RichardScriven(x) 55.034 63.1360 65.7545 68.4785 108.043 100

**Benchmark 2:**

Now, with bigger data.

x <- paste(sample(c("xx", "yy", "xy"), 1e5, replace=TRUE), collapse="") microbenchmark( GSee(x), Shane1(x), Shane2(x), seth(x), geoffjentry(x), drewconway(x), KenWilliams(x), RichardScriven(x), times=3 ) # Unit: milliseconds # expr min lq median uq max neval # GSee(x) 29.029226 31.3162690 33.603312 35.7046155 37.805919 3 # Shane1(x) 11754.522290 11866.0042600 11977.486230 12065.3277955 12153.169361 3 # Shane2(x) 13246.723591 13279.2927180 13311.861845 13371.2202695 13430.578694 3 # seth(x) 86.668439 89.6322615 92.596084 92.8162885 93.036493 3 # geoffjentry(x) 11670.845728 11681.3830375 11691.920347 11965.3890110 12238.857675 3 # drewconway(x) 384.863713 438.7293075 492.594902 515.5538020 538.512702 3 # KenWilliams(x) 12213.514508 12277.5285215 12341.542535 12403.2315015 12464.920468 3 # RichardScriven(x) 11549.934241 11730.5723030 11911.210365 11989.4930080 12067.775651 3

**strsplit: Split the Elements of a Character Vector,** Note that splitting into single characters can be done via split = character(0) or fixed = TRUE)) ## a useful function: rev() for strings strReverse <- function(x) Now you can paste the resulting vector (with collapse=' ') to obtain a single string with spaces.

How about

strsplit(gsub("([[:alnum:]]{2})", "\\1 ", x), " ")[[1]]

Basically, add a separator (here " ") and *then* use `strsplit`

**strtrim: Trim Character Strings to Specified Display Widths,** strtrim: Trim Character Strings to Specified Display Widths x. a character vector, or an object which can be coerced to a character vector by as.character . width. Positive integer values: recycled to the length of x . Products, and Extremes curlGetHeaders: Retrieve Headers from URLs cut: Convert Numeric to Factor cut. Note that splitting into single characters can be done via split=character(0) or split=""; the two are equivalent. The definition of ‘character’ here depends on the locale (and perhaps OS): in a single-byte locale it is a byte, and in a multi-byte locale it is the unit represented by a ‘wide character’ (almost always a Unicode point).

strsplit is going to be problematic, look at a regexp like this

strsplit(z, '[[:alnum:]]{2}')

it will split at the right points but nothing is left.

You could use substring & friends

z <- 'xxyyxyxy' idx <- 1:nchar(z) odds <- idx[(idx %% 2) == 1] evens <- idx[(idx %% 2) == 0] substring(z, odds, evens)

**Substrings of a Character Vector,** Extract or replace substrings in a character vector. When extracting, if start is larger than the string length then "" is returned. That does not really work (you want to limit the width, not the number of characters, so it would be better to use strsplit: Split the Elements of a Character Vector Description Usage Arguments Details Value See Also Examples Description. Split the elements of a character vector x into substrings according to the matches to substring split within them. Usage

Here's one way, but not using regexen:

a <- "xxyyxyxy" n <- 2 sapply(seq(1,nchar(a),by=n), function(x) substr(a, x, x+n-1))

**Split the Elements of a Character Vector,** Split the elements of a character vector x into substrings according to the matches containing regular expression(s) (unless fixed = TRUE ) to use for splitting. repeat { if the string is empty break. if there is a match add the string to the left of For str_split_fixed, if n is greater than the number of pieces, the result will be padded with empty strings. For str_split_n, n is the desired index of each element of the split string. When there are fewer pieces than n, return NA. simplify: If FALSE, the default, returns a list of character vectors. If TRUE returns a character matrix.

**[PDF] Handling and Processing Strings in R,** 6.4.11 String splitting with str split fixed() . analysis is numbers or things that can be mapped to numeric values. Text and times that a searched pattern is found in a character vector. width the (minimum) width of strings produced. character vector (or object which can be coerced to such) containing regular expression(s) (unless fixed = TRUE) to use for splitting. If empty matches occur, in particular if split has length 0, x is split into single characters. If split has length greater than 1, it is re-cycled along x. fixed: logical.

**Split up a string into pieces,** string. Input vector. Either a character vector, or something coercible to one. pattern Match a fixed string (i.e. by comparing only bytes), using fixed() . This is fast, but For str_split_n , n is the desired index of each element of the split string . Parts of the original character vector, returned as a cell array of character vectors or as a string array. C always contains one more element than matches contains. Therefore, if str begins with a delimiter, then the first element of C contains no characters. If str ends with a delimiter, then the last cell in C contains no characters.

**String Processing,** 11.5 Split a string into an array Other relevant tools include cut, paste, grep and sed. A regular expression is a pattern of characters used to match the same clarity the values derived via awk are assigned to variables using the set command. substitutes the next variable in the argument list, allowing the width to be The Width block generates as output the width of its input vector. You can use an array of buses as an input signal to a Width block. For details about defining and using an array of buses, see Combine Buses into an Array of Buses.

##### Comments

- so you want to split the string at intervals based on a known count, strsplit() works on fixed strings or reg exps, but is sounds like you want it done by length?
- that's exactly right. I want to do it based on length. strsplit wants to match a regex expression for delimiter and I don't have a delimiter.
- There is a much faster answer in stackoverflow.com two years later. http://stackoverflow.com/a/11619681/168976.
- @wind you should make that an answer, I think. It would be a good addition to the answers.
`str_match_all(x, ".{2}")`

- Just adding for generality that if we wanted every
`n`

characters instead of every 2, it'd be:`substring(x,seq(1,nchar(x),n),seq(n,nchar(x),n))`

- try Ralf Stubner's c++ function from his answer stackoverflow.com/a/50999966/2371031
- that's a sweet way of doing it as well. I think I let myself get mentally hooked on srtsplit() because of how close strsplit(x,"") is to what I want.
- how would the substring work if you have to chop the string after 3 characters? looks like it will only work for 2 character chops.
- That's exactly the hack I was coding up. of course I was going to do a loop instead of sapply ;)
- You man, are a genius! I used
`x <- paste0(x, strrep(" ", n - (nchar(x) %% n)))`

, but this is far more convenient!