Extract number from a character string, if it is followed by certain characters in R

r extract number from string column
r extract string between characters
r extract number from string stringr
r extract string after character
r extract multiple numbers from string
r extract string before character
r find character in string
string matching in r

I have a dataframe with a variable that contains food quantities in different measurement units. The dataframe contains ~11000 observations.

Let me give you this example: "10gr peterselie, 7 Grams look, 5g kruiden en 400GRAMM bouillon, 2 tbsp olive-oil, 1oz ketchup, 20 grapes, 1 gelbe Paprika"

I found a way to extract the numbers and sum them up, using this function:

sum_numerics <- function(x) {

  # Grab all numbers that appear 
  matches <- str_match_all(x, "[0-9]+")

  # Grab the matches column in the list, transform to numeric, then sum
  sapply(matches, function(y) sum(as.numeric(y)))

}

What I'm looking for is a way to extract all food quantities that are in grams and write them into a new variable to sum them up in the next step. I spend some time looking for ways to do this and spend some time solving the problem with the regex-demo, but I can't find a working solution and I really can't figure out how to write working regex-functions. Shame on me!

User "Max Teflon" provided a possible solution that looks, after some more investigation, like this:

get_gramms <- function(x) {

# Grab all numbers that appear
str_extract_all(x, "([0-9]+\\s?([gG]|[gGrRaAmM]{5,6}|[gGrRaAmM]{2}))") %>% # any number followed by an optional space and a small/capital g%>%

unlist() %>%

str_remove_all('[[:alpha:]]') %>% # a vector is what we want

str_trim() %>% # remove all trailing whitespaces

as.numeric() # change to numbers

}

x %>%
mutate(var = map(var,~get_gramms(.))) %>%
mutate(var = map_dbl(var,~ifelse(length(.)>0,sum(.),NA)))

I think his answer is close to solving my problem, but it still returns wrong values, for example for "1 gelbe Paprika".

Looking forward to new ideas, solutions!

Maybe you can try the code below, using gsub() + regmatches() + gregexpr() from base R

r <- sum(as.numeric(gsub("(\\d+).*",
                         "\\1",
                         unlist(regmatches(s,gregexpr("\\d+\\s?(g|gr|grams|gram)\\b",s,ignore.case = T))))))

such that

> r
[1] 422

DATA

s <- "10gr peterselie, 7 Grams look, 5g kruiden en 400GRAMM bouillon, 2 tbsp olive-oil, 1oz ketchup"

EDIT: If you want to the manipulation along a column, maybe you can do it like below

f <- Vectorize(function(s) {
  sum(as.numeric(gsub("(\\d+).*",
                      "\\1",
                      unlist(regmatches(s,gregexpr("\\d+\\s?(g|gr|grams|gram)\\b",s,ignore.case = T))))))
}
)

df <- within(df, y <- f(x))
df <- within(df, y <- ifelse(y==0,NA,1))

Extraction of numbers from a character string, @BondedDust and @tcash21 kindly provide me some help on stackoverflow to I'm using the following function to extract the numerical values: If there are multiple numbers in the character string, it only extracts the first one: in character mode only (see next section about the import of Excel files in R). ^ - start of string [^_]* - 0+ chars other than _ _ - an underscore (\\d+) - Group 1: one or more digits.* - the rest of the string. The sub function will only perform a single search and replace operation on each string, and the \1 backreference in the replacement will put back the contents in Group 1. Online R demo:

You could use a look-ahead assertion and remove the whitespaces afterwards:

library(tidyverse)
x <- "10gr peterselie, 7 Grams look, 5g kruiden en 400GRAMM bouillon, 2 tbsp olive-oil, 1oz ketchup"

sum_numerics <- function(x) {

  # Grab all numbers that appear 
  str_match_all(x, "[0-9]+\\s?(?=[gG])") %>% # any number followed by an optional space and a small/capital g
    unlist() %>% # a vector is what we want
    str_trim() %>% # remove all trailing whitespaces
    as.numeric() %>% # change to number
    sum() # sum it up

}
sum_numerics(x)
#> [1] 422

Or, if you just want to get all the numbers and use them afterwards:

library(tidyverse)
x <- "10gr peterselie, 7 Grams look, 5g kruiden en 400GRAMM bouillon, 2 tbsp olive-oil, 1oz ketchup"

get_gramms <- function(x) {

  # Grab all numbers that appear 
  str_match_all(x, "[0-9]+\\s?(?=[gG])") %>% # any number followed by an optional space and a small/capital g
    unlist() %>% # a vector is what we want
    str_trim() %>% # remove all trailing whitespaces
    as.numeric() # change to numbers
}
get_gramms(x)
#> [1]  10   7   5 400

Note that the whitespace can not be put into the assertion since it is optional and an assertion needs a fixed length.

String Manipulation in R with stringr, You'll start with some basics: how to enter strings in R, how to control how sequence and allows us to include special characters in our strings. Each available character has a Unicode code point: a number that uniquely identifies it . library(stringr) library(babynames) library(dplyr) # Extracting vectors� Arguments string. Input vector. Either a character vector, or something coercible to one. pattern. Pattern to look for. The default interpretation is a regular expression, as described in stringi::stringi-search-regex.

This is somewhat ugly but we can use:

sum(as.numeric(unlist(sapply(strsplit(my_string,","),
        function(x) stringr::str_extract_all(gsub("\\s","",x),
                "\\d+(?=[gG][rams]?)")))))#credit to ThomasisCoding(learnt something new)
[1] 422

Data:

my_string<-"10gr peterselie, 7 Grams look, 5g kruiden en 400GRAMM bouillon, 2 tbsp olive-oil, 1oz ketchup"

19 Data Cleaning, Notice that the meter values are not really numeric but character. In order to have meters as numbers, we should coerce them with as.numeric() To extract the distance pattern we use str_extract() If we want to match month names formed by four letters (e.g. June, July), we could look for the pattern "[A-Z][a-z][a-z][a-z]". string: Input vector. Either a character vector, or something coercible to one. pattern: Pattern to look for. The default interpretation is a regular expression, as described in stringi::stringi-search-regex.

Using str_extract_all

library(stringr)

str_extract_all(my_string,"[0-9]+(?=[ ]{0,2}[gG])")[[1]] %>% 
  as.numeric()%>%
  sum()

[1] 422

if now you have a vector of strings:

mystrings <- c("10gr peterselie, 7 Grams look, 5g kruiden en 400GRAMM bouillon, 2 tbsp olive-oil, 1oz ketchup",
               "but also 5g of something and 10 Gr of other stuffs")

str_extract_all(mystrings,"[0-9]+(?=[ ]{0,2}[gG])") %>%
  lapply(.,function(x) as.numeric(x) %>%
           sum()
         )

[[1]]
[1] 422

[[2]]
[1] 15

Introduction to stringr, This is now equivalent to the base R function nchar() . It takes three arguments: a character vector, a start position and an end position. The positions are inclusive, and if longer than the string, will be silently truncated. What are the phone numbers? str_extract(strings, phone) #> [1] NA "219 733 8965" "329-293- 8753"� To find out how many characters should be extracted, you subtract the position of the first digit from the total length of the string, and add one to the result because the first digit is also to be included: =RIGHT (B2, LEN (A2)-B2+1) Where A2 is the original string and B2 is the position of the first digit.

[PDF] Work with strings with stringr : : CHEAT SHEET, Some characters cannot be represented directly in an R string . These must be represented as special characters, sequences of characters that have a specific� A simple cheatsheet by examples. UPDATE! Check out my new REGEX COOKBOOK about the most commonly used (and most wanted) regex 🎉. Regular expressions (regex or regexp) are extremely useful in

str_extract function, string. Input vector. Either a character vector, or something coercible to one. pattern. Pattern to If FALSE , the default, returns a list of character vectors. If TRUE� 14.2.1 String length. Base R contains many functions to work with strings but we’ll avoid them because they can be inconsistent, which makes them hard to remember. Instead we’ll use functions from stringr. These have more intuitive names, and all start with str_. For example, str_length() tells you the number of characters in a string:

Extract Numbers From String Using Java Regular Expressions, The following are examples which show how to extract numbers from a Follow on Twitter If you want to extract only certain numbers from a string you can to extract a part of a String which contains digits and characters.

Comments
  • You want 10,7,5,400?
  • Hey Nelson, yes I want to extract each number that is followed by g, gr, grams, gramm or however a person could indicate gramms.
  • Solves my problem, but I think it is not enough to look for small and capital g, because in some rows there could be something like "20 Grapes".
  • @SebasSchu I updated my answer so you can try if it works well
  • Because my dataset has ~11000 observations, I need something that returns NA, if there are no information in grams within the character-string.
  • @SebasSchu My code with return with 0 when nothing found. In that case ,you can replace 0 by NA, e.g., r <- ifelse(r==0,NA,r)
  • @SebasSchu no, you cannot use the code that way, see my updates in "EDIT" for your updated information
  • Thank you Max, your solution helps a lot, even though it does not completely provide the result I'm looking for. If I apply this function to a variable I want all numerics and the following unit (g, gr, GRAMM, whatever) per row. In the case of my example I'm looking for this in y: "10gr 7 Grams 5g 400GRAMM"
  • Ah, I misinterpreted your function-name. If you do only want to get the numbers pre-sum, just remove the %>% sum() part in the code above. Is that the solution you are searching for? I edited the code approprately
  • Thank you Max, your answer solves the problem quite well, except that I need the function to return NA, if there are no values in x that match. Otherwise I get an error when using the function to mutate a new variable y, that the columns don't have the same length.
  • So you want to turn your character input to a vector in the appropriate length (in your example 6?) This is not really the scope of your question. It would be helpful if you could add the result you are expecting from your example to your question. Am I right to think, that you expect 10, 7, 5, 400, NA, NA as your output? Do you want to parse only dutch recipes or other languages as well? If the latter, the inconsistent separation with ands (en) and commas could complicate things.
  • Sorry Max, to clarify: The string above is just an example. I have a dataset that consists of ~11000 observations and one variable contains information like the one I used as an example. I want to extract all values in grams, but if there are none, it should return NA for this observation. Is that understandable?