Regular expression to extract first word + first character of all following words

I am (newbie) using R and regular Regular expression to write the regex for manipulating strings in a data.frame column. My data look like this in R:

Peter Parker            
Hawk & Dove             
J Jonah Jameson         
3JPX spo                
Bruce Wayne              

What I am trying to get is 2nd column "c2" that consists of the following strings:


Basically I want the entire first word of the string (regardless of length) and the first alphanumeric element of every word after. I have not been able to find any function or logic for this. Is it possible to do so with regex?

Thanks in Advance

Here is a base R approach using gsub:

x <- c("Peter Parker", "Hawk & Dove", "J Jonah Jameson", "3JPX spo", "Bruce Wayne")
output <- gsub("\\s+(\\S)\\S*(?!\\S)", "\\1", x, perl=TRUE)

[1] "PeterP" "Hawk&D" "JJJ"    "3JPXs"  "BruceW"

The regex pattern \s+(\S)\S*(?!\S) matches one or more space characters, then matches and captures the first character of the name component. It also consumes the remainder of the name component, replacing with only the captured first character.

In case the above still be unclear to you, here is how the regex pattern works, step by step:

\s+    match one or more space characters
(\S)   then match AND capture the first character of the name-word
\S*    match the remainder of the name-word
(?!\S) assert that what follows the end of the name-word is either a space
       or the end of the string

The replacement in the call to gsub is just \1, which is the first and only capture group, corresponding to the first letter of each name, beyond the very first name.

Regex to match first word in sentence, You can use this regex: ^[^\s]+ or ^[^ ]+ . Explanation: (? <=^[\s"']*) positive lookbehind in order to look for the start of the string, followed by zero or more spaces or punctuation characters (you can add more between the brackets), but do not include it in the match. For example, if I wanted to extract a numeric value which I know follows directly after a word or set of letters, I could use the regular expression “[a-zA-Z]+([0-9]+)" this matches the whole expression, but allows you to select the portion in the parentheses (called a substring).

Though not particularly a regex solution but a different approach could be to get bring data in long format by separating each word, get first word as it is and take only first character from remaining of the words and paste them.


df %>%
  group_by(row = row_number()) %>%
  tidyr::separate_rows(c1, sep = "\\s+") %>%
  summarise(c2 = paste0(first(c1) , paste0(substr(c1[-1], 1, 1), collapse = "")),
            c1 = paste(c1, collapse = " ")) %>%
  select(c1, c2, -row)

#   c1              c2    
#  <chr>           <chr> 
#1 Peter Parker    PeterP
#2 Hawk & Dove     Hawk&D
#3 J Jonah Jameson JJJ   
#4 3JPX spo        3JPXs 
#5 Bruce Wayne     BruceW


df <- structure(list(c1 = c("Peter Parker", "Hawk & Dove", "J Jonah Jameson", 
"3JPX spo", "Bruce Wayne")), row.names = c(NA, -5L), class = "data.frame")

RegEx - Extracting the first N words, I was recently asked by a colleague for some help with a RegEx expression for extracting the first N number of words from a block of text. We have discussed a solution for C++ in this post : Program to extract words from a given String We have also discussed basic approach for java in these posts : Counting number of lines, words, characters and paragraphs in a text file using Java and Print first letter in word using Regex.

The development version of unglue features a multiple argument, which can be a function to apply to identically named matches (here we'd want to concatenate them with paste0()). In our case we want to match the full first word, then the first character of all sequences separated by space, and we have either 1 or 2 of such sequences following the first word:

# remotes::install_github("moodymudskipper/unglue")
patterns <- c(
  "{c2} {c2=\\S}{=\\S*} {c2=\\S}{=\\S*}",
  "{c2} {c2=\\S}{=\\S*}")

unglue_data(df$c1, patterns, multiple = paste0)
#>       c2
#> 1 PeterP
#> 2 Hawk&D
#> 3    JJJ
#> 4  3JPXs
#> 5 BruceW  

Extract first three words using RegEx?, I have created a workflow that extracts a certain number of characters from a string , however what is the best way to extract the first 5 words. My first step was to extract the date value only (everything before the first space) using an 'Extract' Regular Expression of ^\S* and passing it into a collection variable called varCreatedDate \s - is a whitespace character ^ - an anchor to match the start of the string * - matching the preceeding element zero or more times.

How can I extract the First Word in a Cell when the words are , The easiest way is to use a pattern-match feature called "regular expression matching": =REGEXEXTRACT(C1, "[^,]*"). This means: Extract a sequence of� Regular expression to get only the first word from each line. Ask Question Asked 5 years, 1 month ago. Active 6 months ago. Viewed 9k times 1. 1. I have a text file

Print first letter of each word in a string using regex, Given a string, extract the first letter of each word in it. “Words” are defined as contiguous strings of alphabetic characters i.e. any upper or lower� And I want to create a new column with just the brand name so the first word of column brand regardless of what follows. I want output as below: brand (column name) Channel Gucci Channel LV LV I have tried to use sub with below codes but it doesn't work. Could you please help out what is wrong with my code? brand <- sub("(\\w+).*", "\\1", dat

How can I extract a portion of a string variable using regular , We will show some examples of how to use regular expression to extract and/or Now we need to capture the first word and the second word and swap them. extract() Extract capture groups in the regex pat as columns in a DataFrame and returns the captured groups: findall() Find all occurrences of pattern or regular expression in the Series/Index. Equivalent to applying re.findall() on all elements: match() Determine if each string matches a regular expression. Calls re.match() and returns a

  • BTW, do you mean to have c2 from c1, or is that a typo?
  • Yeah. I want the values in column c2 to be derived from the values in column c1
  • Ahhh, column names. I did not assume that that was a data.frame or matrix. At times, it can be both useful (to us) and absolutely clear to provide data in a more unambiguous format, such as programmatically with data.frame(...) or with dput(x); while the latter does not look as awesome, it can gives a completely-identical object with the least effort (on our part).
  • I did actually say it was a data.frame column in the question title. But next time onwards i'll use 'data.frame(...)' notation as well. :) thanks
  • Bad on me, thanks. (I find multi-line titles to be a bit busy, so I must have skimmed too quickly. I'll try better next time :-)
  • Thanks this works perfectly.Just out of curiosity- is there a word limit to number of words in string it will work on?
  • I'm not sure I follow your comment. Can you show me one of the current inputs along with how you want it to look under your new requirements?
  • for example - Albus Percival Wulfric Brian Dumbledore -> AlbusPWBD. Would the above approach work on 5 words or longer string as well?
  • @JohnR Yes, it would work on names consisting of any number of name words. Try it out. The g in gsub means "global" replacement, so it covers all words.
  • JohnR, have you tried it? Especially with as short as Tim's solution is, it would take you less time to try it on that name than to type the question ... and then you have to wait for the reply.