split character columns and get names of field in string
I need to split a column that contains information into several columns.
tstrsplit but the same kind of information is not in the same order among the rows and I need to extract the name of the new column within the variable. Important to know: there can be many pieces of information (fields to become new variables) and I don't know all of them, so I don't want a "field by field" solution.
Below is an example of what I have:
library(data.table) myDT <- structure(list(chr = c("chr1", "chr2", "chr4"), pos = c(123L, 435L, 120L), info = c("type=3;end=4", "end=6", "end=5;pos=TRUE;type=2" )), class = c("data.table", "data.frame"), row.names = c(NA,-3L)) # chr pos info #1: chr1 123 type=3;end=4 #2: chr2 435 end=6 #3: chr4 120 end=5;pos=TRUE;type=2
And I'd like to get:
# chr pos end pos type #1: chr1 123 4 <NA> 3 #2: chr2 435 6 <NA> <NA> #3: chr4 120 5 TRUE 2
A most straightforward way to get that would be much appreciated! (Note: I'm not willing to go with a dplyr/tidyr way)
regex and the
setDT(myDT) # After creating data.table from structure() library(stringi) fields <- unique(unlist(stri_extract_all(regex = "[a-z]+(?==)", myDT$info))) patterns <- sprintf("(?<=%s=)[^;]+", fields) myDT[, (fields) := lapply(patterns, function(x) stri_extract(regex = x, info))] myDT[, !"info"] chr pos type end 1: chr1 <NA> 3 4 2: chr2 <NA> <NA> 6 3: chr4 TRUE 2 5
Edit: To get the correct type it seems (?)
type.convert() can be used:
myDT[, (fields) := lapply(patterns, function(x) type.convert(stri_extract(regex = x, info), as.is = TRUE))]
Split data frame string column into multiple columns, We could also both make sure that the resulting columns will have correct types and improve out <- strsplit(as.character(before$type),'_and_') uses stringr to split a column, given the pattern and a name prefix for the generated columns. Split text string at specific character using formula We can split texts at specific character using function LEFT, RIGHT, LEN and FIND. There will be of course at least two parts of text split. First, the text before specific text (text on left of the character) and second, the text after the specific text (text on the right of the character).
Let's create example VCF file for testing:
# subset some data from 1000genomes data tabix -h ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz 17:1471000-1472000 > myFile.vcf # zip it and index: bgzip -c myFile.vcf > myFile.vcf.gz tabix -p vcf myFile.vcf.gz
Now we can use bcftools. Here as an example we are subsetting AF and DP from INFO column:
bcftools query -f '%CHROM %POS %INFO/AF %INFO/DP \n' myFile.vcf.gz 17 1471199 1916 0.088 17 1471538 2445 0.016 17 1471611 2733 0.239 17 1471623 2815 0.003 17 1471946 1608 0.007 17 1471959 1612 0.014 17 1471975 1610 0.179
See the manual for more query options.
Separate a character column into multiple columns with a regular , Separate a character column into multiple columns with a regular expression or numeric locations Names of new variables to create as character vector. Positive values start at 1 at the far-left of the string; negative value start at -1 at unite() , the complement, extract() which uses regular expression capturing groups. To split a text string at a certain character, you can use a combination of the LEFT, RIGHT, LEN, and FIND functions. In the example shown, the formula in C5 is: = LEFT ( B5 , FIND ( "_" , B5 ) - 1 )
We could split on
";" then reshape wide-to-long, then split again on
"=", then reshape back to long-to-wide:
dcast( melt(dt[, paste0("col", 1:3) := tstrsplit(info, split = ";") ], id.vars = c("chr", "pos", "info"))[, -c("info", "variable")][ ,c("x1", "x2") := tstrsplit(value, split = "=")][ ,value := NULL][ !is.na(x1), ], chr + pos ~ x1, value.var = "x2") # chr pos end pos type # 1: chr1 123 4 <NA> 3 # 2: chr2 435 6 <NA> <NA> # 3: chr4 120 5 TRUE 2
An improved / more readible version:
dt[, paste0("col", 1:3) := tstrsplit(info, split = ";") ][, melt(.SD, id.vars = c("chr", "pos", "info"), na.rm = TRUE) ][, -c("info", "variable") ][, c("x1", "x2") := tstrsplit(value, split = "=") ][, dcast(.SD, chr + pos ~ x1, value.var = "x2")]
Split Text based on Character/s in Excel, First is using Excel Formulas to split text and other is Text to Column method. Apply above generic formula here to get text on the left of the comma in string. Finally, we will have LEFT(A2,15) and we get name extracted (15 characters from � If numeric, sep is interpreted as character positions to split at. Positive values start at 1 at the far-left of the string; negative value start at -1 at the far-right of the string. The length of sep should be one less than into. remove: If TRUE, remove input column from output data frame. convert: If TRUE, will run type.convert() with as.is
For now, I managed to get what I want with the following code:
newDT <- reshape(splitstackshape::cSplit(myDT, "info", sep=";", "long")[, c(.SD, tstrsplit(info, "="))], idvar=c("chr", "pos"), direction="wide", timevar="V4", drop="info") setnames(newDT, sub("V5\\.", "", names(newDT))) newDT # chr pos type end pos #1: chr1 123 3 4 <NA> #2: chr2 435 <NA> 6 <NA> #3: chr4 120 2 5 TRUE
Two options to improve the lines above, thanks to @A5C1D2H2I1M1N2O1R2T1 (who gave them in comments) :
. with a double
cSplit prior to
cSplit(cSplit(myDT, "info", ";", "long"), "info", "=")[, dcast(.SD, chr + pos ~ info_1, value.var = "info_2")]
dcast instead of
cSplit(myDT, "info", ";", "long")[, c("t1", "t2") := tstrsplit(info, "=", fixed = TRUE)][, dcast(.SD, chr + pos ~ t1, value.var = "t2")]
How To Split A Column or Column Names in Pandas and Get Part of , Just like Python, Pandas has great string manipulation abilities that lets you manipulate strings easily. Let us see an example of using Pandas to� Example #2: Making separate columns from string. In this example, the Name column is separated at space (” “), and the expand parameter is set to True, which means it will return a data frame with all separated strings in different columns. The Data frame is then used to create new columns and the old Name column is dropped using .drop
Here's how I'd do it :
library(data.table) myDT <- structure(list(chr = c("chr1", "chr2", "chr4"), pos = c(123L, 435L, 120L), info = c("type=3;end=4", "end=6", "end=5;pos=TRUE;type=2" )), class = c("data.table", "data.frame"), row.names = c(NA,-3L)) R_strings <- paste0("list(", chartr(";", ",", myDT$info),")") lists <- lapply(parse(text=R_strings),eval) myDT[,info:=NULL] myDT <- cbind(myDT,rbindlist(lists, fill = TRUE)) myDT #> chr pos type end pos #> 1: chr1 123 3 4 NA #> 2: chr2 435 NA 6 NA #> 3: chr4 120 2 5 TRUE
Created on 2019-11-29 by the reprex package (v0.3.0)
separate: Separate a character column into multiple columns with a , separate: Separate a character column into multiple columns with a. Column name or position. Names of new variables to create as character vector. Positive values start at 1 at the far-left of the string; negative value start at -1 at the unite() , the complement, extract() which uses regular expression capturing groups. STRING_SPLIT – Split Delimited List In a Multiple Columns In the following query, the @Records table has got two columns. Player names and their list of won trophies stored as comma separated values. Using STRING_SPLIT function we convert trophy names into a single column and associating it with player name.
Python, Pandas provide a method to split string around a passed separator/delimiter. n: Numbers of max separations to make in a single string, default is -1 which means all. As shown in the output image, the Team column is now having a list. and digits from string list � Python | Ways to split strings on Uppercase characters� It allows for parsing out and returning individual segments of a string value into separate columns. Since the PARSENAME function breaks down the string, you are not obligated to return all the delimited values. As in our sample above, you could have returned only the area code from the "empPhone" column to filter certain area codes in your search.
Split text into different columns with functions, The positions of the spaces within the text string are also important because they how to extract various components from a variety of name formats using these You can also split text into different columns with the Convert Text to Columns the highlight in the full name shows the character that the matching SEARCH� The length of the return type is the same as the length of the string argument. Remarks. STRING_SPLIT inputs a string that has delimited substrings, and inputs one character to use as the delimiter or separator. STRING_SPLIT outputs a single-column table whose rows contain the substrings. The name of the output column is value.
Split Data Frame Variable into Multiple Columns in R (3 Examples), How to separate a character string column into multiple variables - 3 R Looks good, but you may find the previous R code kind of complicated (I agree). With our Ultimate Suite installed in your Excel, a column of names in various formats can be split in 2 easy steps: Select any cell containing a name you want to separate and click the Split Names icon on the Ablebits Data tab > Text group. Select the desired names parts (all of them in our case) at click Split. Done!