Break string into several columns using tidyr::extract regex

tidyr separate multiple columns
r split column into multiple columns by separator
r split column into multiple rows
tidyr extract
r separate function
str_split_fixed
tidyr::separate unknown number of columns
split string into two columns

I'm trying to break a string vector into several variables using regular expressions in R, preferably in a dplyr-tidyr way using the tidyr::extract command. For insctance in the vector bellow:

sasdic <- data.frame(a=c(
  '@1   ANO_CENSO   5.  /*Ano do Censo*/',
  '@71  TP_SEXO $Char1. /*Sexo*/',
  '@72  TP_COR_RACA $Char1. /*Cor/raça*/',
  '@74  FK_COD_PAIS_ORIGEM  4.  /*Código País de origem*/' )) 

I would like for the:

  • first number ([0-9]+) to go to variable "int_pos"
  • the variable name connected by undersline ([a-zA-Z_]+) to go to variable "var_name"
  • The second number or the term $Char1 (could be $Char2, etc) to go to var "x". I figured ([0-9]+|$Char[0-9]+) could select this?
  • Lastly, whatever comes in between "/* ... /" to go to variable "label" (don´t know the regex for this). All other intermidiate caracters (blank spaces, ".", "/", "" should be disconsidered)

This would be the result

d <- data.frame(int_pos=c(1,72,72,74),
                var_name=c('ANO_CENSO','TP_SEXO','TP_COR_RACA','FK_COD_PAIS_ORIGEM'),
                x=c('5','Chart1','$Char1','4'),
                label=c('Ano do Censo','Sexo','Cor/raça','Código País de origem') )

I tryed to construct a regular expression for this. This is what I got so far:

sasdic %>% extract(a, c('int_pos','var_name','x','label'), 
                   "([0-9]+)([a-zA-Z_]+)([0-9]+|$Char[0-9]+)(something to get the label") 
             -> d

above the regular expression is incomplete. Also, I don't know hot to make explicit in the extract command syntax, what are the parts to be recovered and what are the parts to leave out.


In the regex used, we are matchng one more more punctuation characters ([[:punct:]]+) i.e. @ followed by capturing the numeric part ((\\d+) - this will be our first column of interest), followed by one or more white-space (\\s+), followed by the second capture group (\\S+ - one or more non white-space character i.e. "ANO_CENSO" for the first row), followed by space (\\s+), then we capture the third group (([[:alum:]$]+) - i.e. one or more characters that include the alpha numeric along with $ so as to match $Char1), next we match one or more characters that are not a letter ([^A-Za-z]+- this should get rid of the space and *) and the last part we capture one or more characters that are not * (([^*]+).

sasdic %>% 
      extract(a, into=c('int_pos', 'var_name', 'x', 'label'),
   "[[:punct:]](\\d+)\\s+(\\S+)\\s+([[:alnum:]$]+)[^A-Za-z]+([^*]+)")

#  int_pos           var_name      x                 label
#1       1          ANO_CENSO      5          Ano do Censo
#2      71            TP_SEXO $Char1                  Sexo
#3      72        TP_COR_RACA $Char1              Cor/raça
#4      74 FK_COD_PAIS_ORIGEM      4 Código País de origem

Separate a character column into multiple columns with a regular , Separate a character column into multiple columns with a regular expression or numeric If numeric, sep is interpreted as character positions to split at. Positive values start at 1 at the far-left of the string; negative value start at -1 at the unite​() , the complement, extract() which uses regular expression capturing groups. Given a regular expression with capturing groups, extract() turns each group into a new column. If the groups don't match, or the input is NA, the output will be NA. If the groups don't match, or the input is NA, the output will be NA.


This is another option, though it uses the data.table package instead of tidyr:

library(data.table)
setDT(sasdic)

# split label
sasdic[, c("V1","label") := tstrsplit(a, "/\\*|\\*/")]                   
# remove leading "@", split remaining parts
sasdic[, c("int_pos","var_name","x") := tstrsplit(gsub("^@","",V1)," +")]
# remove unneeded columns
sasdic[, c("a","V1") := NULL]                                            

sasdic

#                    label int_pos           var_name       x
# 1:          Ano do Censo       1          ANO_CENSO      5.
# 2:                  Sexo      71            TP_SEXO $Char1.
# 3:              Cor/raça      72        TP_COR_RACA $Char1.
# 4: Código País de origem      74 FK_COD_PAIS_ORIGEM      4.

This assumes that the "remaining parts" (aside from the label) are space-separated.


This could also be done in one block (which is what I would do):

sasdic[, c("a","label","int_pos","var_name","x") := {
  x   = tstrsplit(a, "/\\*|\\*/")
  x1s = tstrsplit(gsub("^@","",x[[1]])," +")
  c(list(NULL), x1s, x[2])
}]

Extract a character column into multiple columns using regular , Extract a character column into multiple columns using regular expression groups Given a regular expression with capturing groups, extract() turns each group into a new column. This is passed to tidyselect::vars_pull() . NB: this will cause string "NA" s to be converted to NA s. separate() to split up by a separator. Breaking up a string into columns using regex in pandas. raw female date score state; 0: Arizona 1 2014-12-23 3242.0: 1: 2014-12-23: 3242.0


You could use the package unglue :

library(unglue)
unglue_unnest(sasdic, a, "@{int_pos}{=\\s+}{varname}{=\\s+}{x}.{=\\s+}/*{label}*/")
#>   int_pos            varname      x                       label
#> 1       1          ANO_CENSO      5                Ano do Censo
#> 2      71            TP_SEXO $Char1                        Sexo
#> 3      72        TP_COR_RACA $Char1                 Cor/ra<e7>a
#> 4      74 FK_COD_PAIS_ORIGEM      4 C<f3>digo Pa<ed>s de origem

separate: Separate a character column into multiple columns using , Given either regular expression or a vector of character positions, separate() turns a single character column into multiple columns. In tidyr: Tidy Messy Data​. Description If numeric, interpreted as positions to split at. Positive values start at 1 at the far-left of the string; negative value start at -1 at the far-right of the string​. If character, sep is interpreted as a regular expression. The default value is a regular expression that matches any sequence of non-alphanumeric values. If numeric, sep is interpreted as character positions to split at. Positive values start at 1 at the far-left of the string; negative value start at -1 at the far-right of the string.


extract: Extract a character column into multiple columns using in , Given a regular expression with capturing groups, extract() turns each group into a new extract: Extract a character column into multiple columns using In tidyr: Tidy Messy Data This is passed to tidyselect::vars_pull() . NB: this will cause string "NA" s to be converted to NA s. separate() to split up by a separator. If you want to split a string that matches a regular expression instead of perfect match, use the split() of the re module. re.split() — Regular expression operations — Python 3.7.3 documentation; In re.split(), specify the regular expression pattern in the first parameter and the target character string in the second parameter.


How to Split Text in a Column in Data Frame in R?, How to Split a Single Column into Multiple Columns with tidyr' separate()? By default, separate uses regular expression that matches any sequence of non-​alphanumeric values as delimiter to split. df %>% tidyr:: separate (file_name, How to Get Unique Values from a Column in Pandas Data Frame? MS Access: Split Function. This MSAccess tutorial explains how to use the Access Split function with syntax and examples. Description. The Microsoft Access Split function will split a string into substrings based on a delimiter. The result is returned as an array of substrings. Syntax. The syntax for the Split function in MS Access is:


Split uneven length vectors to columns with tidyr, library(stringr) library(tidyr) data<-tibble::tribble( ~value, "RATINGS: 4 MEAN: you can use tidyverse tools to get a fixed length column to separate. CALORIES: 213 ABV: 7.1%" ) # you can split the string in several, here is a It is what regular expressions call a character class and matches any 's' or '/'  string: Input vector. Either a character vector, or something coercible to one. pattern: Pattern to look for. The default interpretation is a regular expression, as described in stringi::stringi-search-regex. Control options with regex(). Match a fixed string (i.e. by comparing only bytes), using fixed(). This is fast, but approximate.