Chopping a string into a vector of fixed width character elements

r extract string before character
r split string into vector
turn character into vector r
r split string by number of characters
substring r
stringr string to vector
r strsplit vector of strings
r cut string after character

I have an object containing a text string:

x <- "xxyyxyxy"

and I want to split that into a vector with each element containing two letters:

[1] "xx" "yy" "xy" "xy"

It seems like the strsplit should be my ticket, but since I have no regular expression foo, I can't figure out how to make this function chop the string up into chunks the way I want it. How should I do this?

Using substring is the best approach:

substring(x, seq(1, nchar(x), 2), seq(2, nchar(x), 2))

But here's a solution with plyr:

library("plyr")
laply(seq(1, nchar(x), 2), function(i) substr(x, i, i+1))

r, Using substring is the best approach: substring(x, seq(1,nchar(x),2), seq(2,nchar(​x),2)). But here's a solution with plyr: library("plyr") laply(seq(1,nchar(x),2),  using a std::vector, so I can populate it at runtime and dynamically change the size of the vector, but keeping the member elements located inside fixed size blocks. Static initialization is not my question - I can do that using boost::assign or other tricks.

Here is a fast solution that splits the string into characters, then pastes together the even elements and the odd elements.

x <- "xxyyxyxy"
sst <- strsplit(x, "")[[1]]
paste0(sst[c(TRUE, FALSE)], sst[c(FALSE, TRUE)])

Benchmark Setup:

library(microbenchmark)

GSee <- function(x) {
  sst <- strsplit(x, "")[[1]]
  paste0(sst[c(TRUE, FALSE)], sst[c(FALSE, TRUE)])
}

Shane1 <- function(x) {
  substring(x, seq(1,nchar(x),2), seq(2,nchar(x),2))
}

library("plyr")
Shane2 <- function(x) {
  laply(seq(1,nchar(x),2), function(i) substr(x, i, i+1))
}

seth <- function(x) {
  strsplit(gsub("([[:alnum:]]{2})", "\\1 ", x), " ")[[1]]
}

geoffjentry <- function(x) {
  idx <- 1:nchar(x)  
  odds <- idx[(idx %% 2) == 1]  
  evens <- idx[(idx %% 2) == 0]  
  substring(x, odds, evens)  
}

drewconway <- function(x) {
  c<-strsplit(x,"")[[1]]
  sapply(seq(2,nchar(x),by=2),function(y) paste(c[y-1],c[y],sep=""))
}

KenWilliams <- function(x) {
  n <- 2
  sapply(seq(1,nchar(x),by=n), function(xx) substr(x, xx, xx+n-1))
}

RichardScriven <- function(x) {
  regmatches(x, gregexpr("(.{2})", x))[[1]]
}

Benchmark 1:

x <- "xxyyxyxy"

microbenchmark(
  GSee(x),
  Shane1(x),
  Shane2(x),
  seth(x),
  geoffjentry(x),
  drewconway(x),
  KenWilliams(x),
  RichardScriven(x)
)

# Unit: microseconds
#               expr      min        lq    median        uq      max neval
#            GSee(x)    8.032   12.7460   13.4800   14.1430   17.600   100
#          Shane1(x)   74.520   80.0025   84.8210   88.1385  102.246   100
#          Shane2(x) 1271.156 1288.7185 1316.6205 1358.5220 3839.300   100
#            seth(x)   36.318   43.3710   45.3270   47.5960   67.536   100
#     geoffjentry(x)    9.150   13.5500   15.3655   16.3080   41.066   100
#      drewconway(x)   92.329   98.1255  102.2115  105.6335  115.027   100
#     KenWilliams(x)   77.802   83.0395   87.4400   92.1540  163.705   100
#  RichardScriven(x)   55.034   63.1360   65.7545   68.4785  108.043   100

Benchmark 2:

Now, with bigger data.

x <- paste(sample(c("xx", "yy", "xy"), 1e5, replace=TRUE), collapse="")

microbenchmark(
  GSee(x),
  Shane1(x),
  Shane2(x),
  seth(x),
  geoffjentry(x),
  drewconway(x),
  KenWilliams(x),
  RichardScriven(x),
  times=3
)

# Unit: milliseconds
#               expr          min            lq       median            uq          max neval
#            GSee(x)    29.029226    31.3162690    33.603312    35.7046155    37.805919     3
#          Shane1(x) 11754.522290 11866.0042600 11977.486230 12065.3277955 12153.169361     3
#          Shane2(x) 13246.723591 13279.2927180 13311.861845 13371.2202695 13430.578694     3
#            seth(x)    86.668439    89.6322615    92.596084    92.8162885    93.036493     3
#     geoffjentry(x) 11670.845728 11681.3830375 11691.920347 11965.3890110 12238.857675     3
#      drewconway(x)   384.863713   438.7293075   492.594902   515.5538020   538.512702     3
#     KenWilliams(x) 12213.514508 12277.5285215 12341.542535 12403.2315015 12464.920468     3
#  RichardScriven(x) 11549.934241 11730.5723030 11911.210365 11989.4930080 12067.775651     3

strsplit: Split the Elements of a Character Vector, Note that splitting into single characters can be done via split = character(0) or fixed = TRUE)) ## a useful function: rev() for strings strReverse <- function(x)  Now you can paste the resulting vector (with collapse=' ') to obtain a single string with spaces.

How about

strsplit(gsub("([[:alnum:]]{2})", "\\1 ", x), " ")[[1]]

Basically, add a separator (here " ") and then use strsplit

strtrim: Trim Character Strings to Specified Display Widths, strtrim: Trim Character Strings to Specified Display Widths x. a character vector, or an object which can be coerced to a character vector by as.character . width. Positive integer values: recycled to the length of x . Products, and Extremes curlGetHeaders: Retrieve Headers from URLs cut: Convert Numeric to Factor cut. Note that splitting into single characters can be done via split=character(0) or split=""; the two are equivalent. The definition of ‘character’ here depends on the locale (and perhaps OS): in a single-byte locale it is a byte, and in a multi-byte locale it is the unit represented by a ‘wide character’ (almost always a Unicode point).

strsplit is going to be problematic, look at a regexp like this

strsplit(z, '[[:alnum:]]{2}')  

it will split at the right points but nothing is left.

You could use substring & friends

z <- 'xxyyxyxy'  
idx <- 1:nchar(z)  
odds <- idx[(idx %% 2) == 1]  
evens <- idx[(idx %% 2) == 0]  
substring(z, odds, evens)  

Substrings of a Character Vector, Extract or replace substrings in a character vector. When extracting, if start is larger than the string length then "" is returned. That does not really work (you want to limit the width, not the number of characters, so it would be better to use  strsplit: Split the Elements of a Character Vector Description Usage Arguments Details Value See Also Examples Description. Split the elements of a character vector x into substrings according to the matches to substring split within them. Usage

Here's one way, but not using regexen:

a <- "xxyyxyxy"
n <- 2
sapply(seq(1,nchar(a),by=n), function(x) substr(a, x, x+n-1))

Split the Elements of a Character Vector, Split the elements of a character vector x into substrings according to the matches containing regular expression(s) (unless fixed = TRUE ) to use for splitting. repeat { if the string is empty break. if there is a match add the string to the left of  For str_split_fixed, if n is greater than the number of pieces, the result will be padded with empty strings. For str_split_n, n is the desired index of each element of the split string. When there are fewer pieces than n, return NA. simplify: If FALSE, the default, returns a list of character vectors. If TRUE returns a character matrix.

[PDF] Handling and Processing Strings in R, 6.4.11 String splitting with str split fixed() . analysis is numbers or things that can be mapped to numeric values. Text and times that a searched pattern is found in a character vector. width the (minimum) width of strings produced. character vector (or object which can be coerced to such) containing regular expression(s) (unless fixed = TRUE) to use for splitting. If empty matches occur, in particular if split has length 0, x is split into single characters. If split has length greater than 1, it is re-cycled along x. fixed: logical.

Split up a string into pieces, string. Input vector. Either a character vector, or something coercible to one. pattern Match a fixed string (i.e. by comparing only bytes), using fixed() . This is fast, but For str_split_n , n is the desired index of each element of the split string . Parts of the original character vector, returned as a cell array of character vectors or as a string array. C always contains one more element than matches contains. Therefore, if str begins with a delimiter, then the first element of C contains no characters. If str ends with a delimiter, then the last cell in C contains no characters.

String Processing, 11.5 Split a string into an array Other relevant tools include cut, paste, grep and sed. A regular expression is a pattern of characters used to match the same clarity the values derived via awk are assigned to variables using the set command. substitutes the next variable in the argument list, allowing the width to be  The Width block generates as output the width of its input vector. You can use an array of buses as an input signal to a Width block. For details about defining and using an array of buses, see Combine Buses into an Array of Buses.

Comments
  • so you want to split the string at intervals based on a known count, strsplit() works on fixed strings or reg exps, but is sounds like you want it done by length?
  • that's exactly right. I want to do it based on length. strsplit wants to match a regex expression for delimiter and I don't have a delimiter.
  • There is a much faster answer in stackoverflow.com two years later. http://stackoverflow.com/a/11619681/168976.
  • @wind you should make that an answer, I think. It would be a good addition to the answers.
  • str_match_all(x, ".{2}")
  • Just adding for generality that if we wanted every n characters instead of every 2, it'd be: substring(x,seq(1,nchar(x),n),seq(n,nchar(x),n))
  • try Ralf Stubner's c++ function from his answer stackoverflow.com/a/50999966/2371031
  • that's a sweet way of doing it as well. I think I let myself get mentally hooked on srtsplit() because of how close strsplit(x,"") is to what I want.
  • how would the substring work if you have to chop the string after 3 characters? looks like it will only work for 2 character chops.
  • That's exactly the hack I was coding up. of course I was going to do a loop instead of sapply ;)
  • You man, are a genius! I used x <- paste0(x, strrep(" ", n - (nchar(x) %% n))), but this is far more convenient!