Subset string by counting specific characters

count specific characters in excel
count specific characters in a range
how do i count a specific character in excel?
excel count how many times a word appears in a cell
vba count specific characters in cell
count characters in excel without spaces
how to count alphabet in excel sheet
excel count specific words in column

I have the following strings:


I want to cut off the string, as soon as the number of occurances of A, G and N reach a certain value, say 3. In that case, the result should be:



I tried to use the stringi, stringr and regex expressions but I can't figure it out.

You can accomplish your task with a simple call to str_extract from the stringr package:



str_extract(strings, '([^AGN]*[AGN]){3}')
# [1] "ABBSDGN" "AABSDG"  "AGN"     "GGG"

The [^AGN]*[AGN] portion of the regex pattern says to look for zero or more consecutive characters that are not A, G, or N, followed by one instance of A, G, or N. The additional wrapping with parenthesis and braces, like this ([^AGN]*[AGN]){3}, means look for that pattern three times consecutively. You can change the number of occurrences of A, G, N, that you are looking for by changing the integer in the curly braces:

str_extract(strings, '([^AGN]*[AGN]){4}')
# [1] "ABBSDGNHN"  NA           "AGNA"       "GGGDSRTYHG"

There are a couple ways to accomplish your task using base R functions. One is to use regexpr followed by regmatches:

m <- regexpr('([^AGN]*[AGN]){3}', strings)
regmatches(strings, m)
# [1] "ABBSDGN" "AABSDG"  "AGN"     "GGG"

Alternatively, you can use sub:

sub('(([^AGN]*[AGN]){3}).*', '\\1', strings)
# [1] "ABBSDGN" "AABSDG"  "AGN"     "GGG"

Excel formula: Count specific characters in a cell, If you need to count how many times a specific a word (or any substring) appears inside a cell, you can use a formula that uses SUBSTITUTE and LEN. In the  How to count cells with specific text in Excel Microsoft Excel has a special function to conditionally count cells, the COUNTIF function. All you have to do is to supply the target text string in the criteria argument. Here's a generic Excel formula to count number of cells containing specific text:

Here is a base R option using strsplit

sapply(strsplit(strings, ""), function(x)
    paste(x[1:which.max(cumsum(x %in% c("A", "G", "N")) == 3)], collapse = ""))
#[1] "ABBSDGN" "AABSDG"  "AGN"     "GGG"

Or in the tidyverse

map_chr(str_split(strings, ""), 
    ~str_c(.x[1:which.max(cumsum(.x %in% c("A", "G", "N")) == 3)], collapse = ""))

Excel formula: Count specific characters in a range, If you need to count how many times a specific a word (or any substring) appears inside a cell, you can use a formula that uses SUBSTITUTE and LEN. In the  To count how many times a specific character appears in a cell, you can use a formula based on the SUBSTITUTE and LEN functions. In the generic form of the formula above, A1 represents the cell address, and "a" represents the character you want to count.

Identify positions of pattern using gregexpr then extract n-th position (3) and substring everything from 1 to this n-th position using subset.

nChars <- 3
pattern <- "A|G|N"
# Using sapply to iterate over strings vector
sapply(strings, function(x) substr(x, 1, gregexpr(pattern, x)[[1]][nChars]))


If there's a string that doesn't have 3 matches it will generate NA, so you just need to use na.omit on the final result.

Count of sub-strings that contain character X at least once , Given a string str and a character X. The task is to find the total number of sub-​strings that contain the character X at least once. Examples: Input: str =  A useful trick is to be able to count the number of times a specific character exists in a text string. The technique for doing this in Excel is a bit clever. To figure out, for example, how many times the letter s appears in the word Mississippi , you can count them by hand, of course, but systematically, you can follow these general steps:

This is just a version without strsplit to Maurits Evers neat solution.

       function(x) {
         raw <- rawToChar(charToRaw(x), multiple = TRUE)
         idx <- which.max(cumsum(raw %in% c("A", "G", "N")) == 3)
         paste(raw[1:idx], collapse = "")
##   "ABBSDGN"    "AABSDG"       "AGN"       "GGG"

Or, slightly different, without strsplit and paste:

test <- charToRaw("AGN")
       function(x) {
         raw <- charToRaw(x)
         idx <- which.max(cumsum(raw %in% test) == 3)

Count of sub-strings that do not contain all the characters from the , Given a string str consisting only of the characters 'a', 'b' and 'c', find the number of sub-strings that do not contain all the three characters at the same time. Returns the substring up to a specific length. Text.Range: Returns a number of characters from a text value starting at a zero-based offset and for count number of characters. Text.Start: Returns the count of characters from the start of a text value.

Interesting problem. I created a function (see below) that solves your problem. It's assumed that there are just letters and no special characters in any of your strings.

 reduce_strings = function(str, chars, cnt){

  # Replacing chars in str with "!"
  chars = paste0(chars, collapse = "")
  replacement = paste0(rep("!", nchar(chars)), collapse = "")
  str_alias = chartr(chars, replacement, str) 

  # Obtain indices with ! for each string
  idx = stringr::str_locate_all(pattern = '!', str_alias)

  # Reduce each string in str
  reduce = function(i) substr(str[i], start = 1, stop = idx[[i]][cnt, 1])
  result = vapply(seq_along(str), reduce, "character")

# Example call
chars = c("A", "G", "N") # Characters that are counted
cnt = 3 # Count of the characters, at which the strings are cut off
reduce_strings(str, chars, cnt) # "ABBSDGN" "AABSDG" "AGN" "GGG"

Count Occurrences of a Char in a String, Learn how to count characters with the core Java library and with libraries can be used for counting chars or even sub-strings in given String. Method B: Extract substring after or before a defined character by Kutools for Excel. For directly extract substring after or before a specified character, you can use the Extract Text utility of Kutools for Excel, which can help you to extract all characters after or before a character, also can extract specific length of characters before or after a character.

substr_count - Manual, substr_count — Count the number of substring occurrences The maximum length after the specified offset to search for the substring. expressions is to consume the characters of the string subject that were matched by the (sub)​pattern. How to replace one substring with another - a formula to find and replace a certain character or substring within the text string. How to count cells containing a substring - how to use the COUNTIF function with wildcard characters to count cells containing specific text, i.e. count with partial match.

Get String Length, Word Count, and Substring Count in PHP, PHP provides functions that count how many characters or words are contained in a string or how many times a particular substring occurs in a string. We will  Text is a (required parameter): This is the text string you want to extract the Excel substring from. Num_chars parameter is Optional. Characters count (subsrting) you want the LEFT function to extract from the string. If you do not specify the Num_chars (Characters from left of the string), Excel will get the first character.

Python 3 - String count() Method, sub − This is the substring to be searched. start − Search starts from this index. First character starts from 0 index. By default search starts from 0 index. end  Given a string and an integer k, find number of substrings in which all the different characters occurs exactly k times. Examples: Input : s = "aabbcc" k = 2 Output : 6 The substrings are aa, bb, cc, aabb, bbcc and aabbcc.

  • I don't think it can get much better to the one-liner str_extract(strings, '([^AGN]*[AGN]){3}'). Nice one!
  • Nice! substr is vectorized, so I would simplify your last line like this: substr(strings, 1, map_int(gregexpr(pattern, strings), nChars)), where map_int from purrr is used.