How can I extract these multiple regex groups in R

r regex capture group
gsub r
str_match
regex two capture groups
regex number group
regex search group
regex not group
perl regex capture group

I have string inputs in the following format:

my.strings <- c("FACT11", "FACT11:FACT20", "FACT1sometext:FACT20", "FACT1text with spaces:FACT20", "FACT14:FACT20", "FACT1textAnd1312:FACT2etc", "FACT12:FACT22:FACT31")

I would like to extract all the "FACT"s and the first number following FACT. So the result from this example would be:

c("FACT1", "FACT1 FACT2", "FACT1 FACT2", "FACT1 FACT2", "FACT1 FACT2", "FACT1 FACT2", "FACT1 FACT2 FACT3")

Alternatively, the result could be a list, where each element of the list is a vector with 1 up to 3 items.

What I got so far is:

gsub("(FACT[1-3]).*?:(FACT[1-3]).*", '\\1 \\2', my.strings)
# [1] "FACT11"       "FACT1 FACT2 " "FACT1 FACT2 " "FACT1 FACT2 " "FACT1 FACT2 " "FACT1 FACT2 "
# [7] "FACT1 FACT2 " "FACT1 FACT2 "

It kinda looks good, except for the "FACT11" for the first element instead of "FACT1" (dropping the second "1"), and missing the "FACT3" for the last element of my.strings. But adding another group to gsub somehow messes the whole thing up.

gsub("(FACT[1-3]).*?:(FACT[1-3]).*?:(FACT[1-3]).*?", '\\1 \\2 \\3', my.strings)
# [1] "FACT11"                       "FACT11:FACT20"                "FACT1sometext:FACT20"        
# [4] "FACT1text with spaces:FACT20" "FACT14:FACT20"                "FACT1textAnd1312:FACT2etc"   
# [7] "FACT12:FACT21"                "FACT1 FACT2 FACT31" 

So how can I properly extract the groups?

You may use a base R approach, too:

> m <- regmatches(my.strings, gregexpr("FACT[1-3]", my.strings))
> sapply(m, paste, collapse=" ")
[1] "FACT1"            
[2] "FACT1 FACT2"      
[3] "FACT1 FACT2"      
[4] "FACT1 FACT2"      
[5] "FACT1 FACT2"      
[6] "FACT1 FACT2"      
[7] "FACT1 FACT2 FACT3"

Extract all matches with your FACT[1-3] (or FACT[0-9], or FACT\\d) pattern, and then "join" them with a space.

Regex group capture in R with multiple capture-groups, str_match() , from the stringr package, will do this. It returns a character matrix with one column for each group in the match (and one for the  Extract one column into multiple columns. Given a regular expression with capturing groups, extract() turns each group into a new column. If the groups don't match, or the input is NA, the output will be NA.

An option would be str_extract_all from stringr to extract all the 'FACT' substring followed by one digit that can be 1 to 3 ([1-3]) into a list of vectors. Then, map through the list elements and paste the vectors to a single strings

library(tidyverse)
str_extract_all(my.strings, "FACT[1-3]") %>%
            map_chr(paste, collapse= ' ')
#[1] "FACT1"             "FACT1 FACT2"       "FACT1 FACT2"      
#[4] "FACT1 FACT2"       "FACT1 FACT2"       "FACT1 FACT2"      
#[7] "FACT1 FACT2 FACT3"

Or using gsub from base R

gsub("\\s{2,}", " ", trimws(gsub("(FACT[1-3])(*SKIP)(*FAIL)|.",
                       " ", my.strings, perl = TRUE)))
#[1] "FACT1"             "FACT1 FACT2"       "FACT1 FACT2"      
#[4] "FACT1 FACT2"       "FACT1 FACT2"       "FACT1 FACT2"      
#[7] "FACT1 FACT2 FACT3"

14 Strings, I recommend always using " , unless you want to create a string that contains multiple " . Use str_length() and str_sub() to extract the middle character from a string. These functions take a character vector and a regular expression, and show You can refer to the same text as previously matched by a capturing group  I also seem to have a common use case for "OR" regex group matching for extracting other data (e.g. extracting an ID from a text field when it takes one or another discreet pattern). The other way I see to achieve it is to run str.extract for each group creating as many new columns as match groups, and then combine these afterwards.

Another base R alternative:

This solution uses the fact the FACT end in a one-digit number.

my.strings %>%  
  gsub("(\\d)\\d*", "\\1:", ., perl = TRUE) %>% 
  strsplit(":") %>%
  sapply(function(x) paste(x[grepl("FACT", x)], collapse = " "))

[1] "FACT1"             "FACT1 FACT2"       "FACT1 FACT2"       "FACT1 FACT2"      
[5] "FACT1 FACT2"       "FACT1 FACT2"       "FACT1 FACT2 FACT3"

Extract a character column into multiple columns using regular , Extract a character column into multiple columns using regular expression groups. Source: R/extract.R. extract.Rd. Given a regular expression with capturing groups, extract() turns each group into a new column. This argument is passed by expression and supports quasiquotation (you can unquote column names or  With one group in the pattern, you can only get one exact result in that group. If your capture group gets repeated by the pattern (you used the + quantifier on the surrounding non-capturing group), only the last value that matches it gets stored.

Extract matching patterns from a string, Source: R/extract.r. str_extract.Rd. Vectorised over string and pattern . str_extract(​string, pattern) str_extract_all(string, pattern, simplify = FALSE) The default interpretation is a regular expression, as described in This is fast, but approximate. str_match() to extract matched groups; stringi::stri_extract() for the underlying  Regex match multiple groups. groups separately but this is just a short extract of a much longer string so that would be a pain. Regex group capture in R with

Regular expressions, Regular expressions are the default pattern engine in stringr. And this tells R to look for an explicit . str_extract(c("abc", "a.c", "bef"), "a\\.c") #> [1] NA "a.c" NA (​Many of these are only of historical interest and are only included here for the Parenthesis also define “groups” that you can refer to with backreferences, like \1​  grep provides a lot of features to match strings, patterns or regex in a given text. One of the most used feature is to match two or more, multiple string, patterns or regex. One of the most used feature is to match two or more, multiple string, patterns or regex.

Introduction to stringr, This is now equivalent to the base R function nchar() . Each pattern matching function has the same first two arguments, a character to match. stringr provides pattern matching functions to detect, locate, extract, group. str_match_all() extracts capture groups from all matches and returns a list of character matrices. These expressions can be used for matching a string of text, find and replace operations, data validation, etc. For example, with regex you can easily check a user's input for common misspellings of a particular word. This guide provides a regex cheat sheet that you can use as a reference when creating regex expressions.

Comments
  • Thanks for the quick reply. Would you also have a solution in case I don't want to leave base R?
  • @bobbel Thanks. I added a base R option