split character columns and get names of field in string

split character columns and get names of field in string

r split column into multiple columns by separator
tidyr::separate
r split column into multiple rows
separate function in r
error: `var` must evaluate to a single number or a column name, not a character vector
tidyr separate multiple columns
r split column by number of characters
r split string by delimiter

I need to split a column that contains information into several columns. I'd use tstrsplit but the same kind of information is not in the same order among the rows and I need to extract the name of the new column within the variable. Important to know: there can be many pieces of information (fields to become new variables) and I don't know all of them, so I don't want a "field by field" solution.

Below is an example of what I have:

library(data.table)

myDT <- structure(list(chr = c("chr1", "chr2", "chr4"), pos = c(123L,
                  435L, 120L), info = c("type=3;end=4", "end=6", "end=5;pos=TRUE;type=2"
                  )), class = c("data.table", "data.frame"), row.names = c(NA,-3L))

#    chr pos                  info
#1: chr1 123          type=3;end=4
#2: chr2 435                 end=6
#3: chr4 120 end=5;pos=TRUE;type=2

And I'd like to get:

#    chr pos end  pos type
#1: chr1 123   4 <NA>    3
#2: chr2 435   6 <NA> <NA>
#3: chr4 120   5 TRUE    2

A most straightforward way to get that would be much appreciated! (Note: I'm not willing to go with a dplyr/tidyr way)


Using regex and the stringi packages:

setDT(myDT) # After creating data.table from structure()

library(stringi)

fields <- unique(unlist(stri_extract_all(regex = "[a-z]+(?==)", myDT$info)))
patterns <- sprintf("(?<=%s=)[^;]+", fields)
myDT[, (fields) := lapply(patterns, function(x) stri_extract(regex = x, info))]
myDT[, !"info"]

    chr  pos type end
1: chr1 <NA>    3   4
2: chr2 <NA> <NA>   6
3: chr4 TRUE    2   5

Edit: To get the correct type it seems (?) type.convert() can be used:

myDT[, (fields) := lapply(patterns, function(x) type.convert(stri_extract(regex = x, info), as.is = TRUE))]

Split data frame string column into multiple columns, We could also both make sure that the resulting columns will have correct types and improve out <- strsplit(as.character(before$type),'_and_') uses stringr to split a column, given the pattern and a name prefix for the generated columns. Split text string at specific character using formula We can split texts at specific character using function LEFT, RIGHT, LEN and FIND. There will be of course at least two parts of text split. First, the text before specific text (text on left of the character) and second, the text after the specific text (text on the right of the character).


I am guessing your data is coming from a VCF file, if so there is a dedicated tool for such problems - bcftools.

Let's create example VCF file for testing:

# subset some data from 1000genomes data
tabix -h ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz 17:1471000-1472000 > myFile.vcf
# zip it and index:
bgzip -c myFile.vcf > myFile.vcf.gz
tabix -p vcf myFile.vcf.gz

Now we can use bcftools. Here as an example we are subsetting AF and DP from INFO column:

bcftools query -f '%CHROM %POS %INFO/AF %INFO/DP \n' myFile.vcf.gz 
17  1471199  1916 0.088
17  1471538  2445 0.016
17  1471611  2733 0.239
17  1471623  2815 0.003
17  1471946  1608 0.007
17  1471959  1612 0.014
17  1471975  1610 0.179

See the manual for more query options.

Separate a character column into multiple columns with a regular , Separate a character column into multiple columns with a regular expression or numeric locations Names of new variables to create as character vector. Positive values start at 1 at the far-left of the string; negative value start at -1 at unite() , the complement, extract() which uses regular expression capturing groups. To split a text string at a certain character, you can use a combination of the LEFT, RIGHT, LEN, and FIND functions. In the example shown, the formula in C5 is: = LEFT ( B5 , FIND ( "_" , B5 ) - 1 )


We could split on ";" then reshape wide-to-long, then split again on "=", then reshape back to long-to-wide:

dcast(
  melt(dt[,  paste0("col", 1:3) := tstrsplit(info, split = ";") ],
       id.vars = c("chr", "pos", "info"))[, -c("info", "variable")][
         ,c("x1", "x2") := tstrsplit(value, split = "=")][
           ,value := NULL][ !is.na(x1), ],
  chr + pos ~ x1, value.var = "x2")

#     chr pos end  pos type
# 1: chr1 123   4 <NA>    3
# 2: chr2 435   6 <NA> <NA>
# 3: chr4 120   5 TRUE    2

An improved / more readible version:

dt[, paste0("col", 1:3) := tstrsplit(info, split = ";")
   ][, melt(.SD, id.vars = c("chr", "pos", "info"), na.rm = TRUE)
     ][, -c("info", "variable")
       ][, c("x1", "x2") := tstrsplit(value, split = "=")
         ][, dcast(.SD, chr + pos ~ x1, value.var = "x2")]

Split Text based on Character/s in Excel, First is using Excel Formulas to split text and other is Text to Column method. Apply above generic formula here to get text on the left of the comma in string. Finally, we will have LEFT(A2,15) and we get name extracted (15 characters from � If numeric, sep is interpreted as character positions to split at. Positive values start at 1 at the far-left of the string; negative value start at -1 at the far-right of the string. The length of sep should be one less than into. remove: If TRUE, remove input column from output data frame. convert: If TRUE, will run type.convert() with as.is


For now, I managed to get what I want with the following code:

newDT <- reshape(splitstackshape::cSplit(myDT, "info", sep=";", "long")[, 
                  c(.SD, tstrsplit(info, "="))], 
                 idvar=c("chr", "pos"), direction="wide", timevar="V4", drop="info")
setnames(newDT, sub("V5\\.", "", names(newDT)))

newDT
#    chr pos type end  pos
#1: chr1 123    3   4 <NA>
#2: chr2 435 <NA>   6 <NA>
#3: chr4 120    2   5 TRUE

Two options to improve the lines above, thanks to @A5C1D2H2I1M1N2O1R2T1 (who gave them in comments) :

. with a double cSplit prior to dcast:

cSplit(cSplit(myDT, "info", ";", "long"), "info", "=")[, dcast(.SD, chr + pos ~ info_1, value.var = "info_2")]

. with cSplit/trstrplit and dcast instead of reshape:

cSplit(myDT, "info", ";", "long")[, c("t1", "t2") := tstrsplit(info, "=", fixed = TRUE)][, dcast(.SD, chr + pos ~ t1, value.var = "t2")]

How To Split A Column or Column Names in Pandas and Get Part of , Just like Python, Pandas has great string manipulation abilities that lets you manipulate strings easily. Let us see an example of using Pandas to� Example #2: Making separate columns from string. In this example, the Name column is separated at space (” “), and the expand parameter is set to True, which means it will return a data frame with all separated strings in different columns. The Data frame is then used to create new columns and the old Name column is dropped using .drop


Here's how I'd do it :

library(data.table)

myDT <- structure(list(chr = c("chr1", "chr2", "chr4"), pos = c(123L,
                                                                435L, 120L), info = c("type=3;end=4", "end=6", "end=5;pos=TRUE;type=2"
                                                                )), class = c("data.table", "data.frame"), row.names = c(NA,-3L))

R_strings <- paste0("list(", chartr(";", ",", myDT$info),")")
lists <- lapply(parse(text=R_strings),eval)
myDT[,info:=NULL]
myDT <- cbind(myDT,rbindlist(lists, fill = TRUE))
myDT
#>     chr pos type end  pos
#> 1: chr1 123    3   4   NA
#> 2: chr2 435   NA   6   NA
#> 3: chr4 120    2   5 TRUE

Created on 2019-11-29 by the reprex package (v0.3.0)

separate: Separate a character column into multiple columns with a , separate: Separate a character column into multiple columns with a. Column name or position. Names of new variables to create as character vector. Positive values start at 1 at the far-left of the string; negative value start at -1 at the unite() , the complement, extract() which uses regular expression capturing groups. STRING_SPLIT – Split Delimited List In a Multiple Columns In the following query, the @Records table has got two columns. Player names and their list of won trophies stored as comma separated values. Using STRING_SPLIT function we convert trophy names into a single column and associating it with player name.


Python, Pandas provide a method to split string around a passed separator/delimiter. n: Numbers of max separations to make in a single string, default is -1 which means all. As shown in the output image, the Team column is now having a list. and digits from string list � Python | Ways to split strings on Uppercase characters� It allows for parsing out and returning individual segments of a string value into separate columns. Since the PARSENAME function breaks down the string, you are not obligated to return all the delimited values. As in our sample above, you could have returned only the area code from the "empPhone" column to filter certain area codes in your search.


Split text into different columns with functions, The positions of the spaces within the text string are also important because they how to extract various components from a variety of name formats using these You can also split text into different columns with the Convert Text to Columns the highlight in the full name shows the character that the matching SEARCH� The length of the return type is the same as the length of the string argument. Remarks. STRING_SPLIT inputs a string that has delimited substrings, and inputs one character to use as the delimiter or separator. STRING_SPLIT outputs a single-column table whose rows contain the substrings. The name of the output column is value.


Split Data Frame Variable into Multiple Columns in R (3 Examples), How to separate a character string column into multiple variables - 3 R Looks good, but you may find the previous R code kind of complicated (I agree). With our Ultimate Suite installed in your Excel, a column of names in various formats can be split in 2 easy steps: Select any cell containing a name you want to separate and click the Split Names icon on the Ablebits Data tab > Text group. Select the desired names parts (all of them in our case) at click Split. Done!