## R: Count all combinations in a list of strings (Specific Order)

I am trying to count all sequences in a large list of characters delimetered by ">" but only the combinations that are directly next to each other.

e.g. given the character vector:

[1]Social>PaidSearch>PaidSearch>PaidSearch>PaidSearch>PaidSearch>PaidSearch>PaidSearch>PaidSearch>PaidSearch>PaidSearch>OrganicSearch>OrganicSearch>OrganicSearch [2]Referral>Referral>Referral

I can run the following line to retrieve all combinations with of 2 characters:

split_fn <- sapply(p , strsplit , split = ">", perl=TRUE) split_fn <- sapply(split_fn, function(x) paste(head(x,-1) , tail(x,-1) , sep = ">") )

Returns:

[[1]] [1] "Social>PaidSearch" "PaidSearch>PaidSearch" "PaidSearch>PaidSearch" "PaidSearch>PaidSearch" "PaidSearch>PaidSearch" [6] "PaidSearch>PaidSearch" "PaidSearch>PaidSearch" "PaidSearch>PaidSearch" "PaidSearch>PaidSearch" "PaidSearch>PaidSearch" [11] "PaidSearch>OrganicSearch" "OrganicSearch>OrganicSearch" "OrganicSearch>OrganicSearch" [[2]] [1] "Referral>Referral" "Referral>Referral"

Which is all possible 2 character sequences in my data (splits in order)

I know want to have all possible outcomes of 3 characters.

e.g.

"Social>PaidSearch>PaidSearch" "PaidSearch>PaidSearch>PaidSearch"..."Referral>Referral>Referral"

Tried to use

unlist(lapply(strsplit(p, split = ">"), function(i) combn(sort(i), 3, paste, collapse='>')))

But it returns all combinations including those that aren't directly following.

I also don't want it to return combinations of the last value in row one with the first value in row 2 etc.

Let's start with creating some data:

set.seed(1) data <- lapply(1:3, function(i) sample(LETTERS[1:3], rpois(1, 6), re = T)) data <- sapply(data, paste, collapse = ">") data #> [1] "B>B>C>A" "C>B>B>A>A>A>C>B>C" "C>C>B>C>C>A"

Given the problem, it makes sense to think of these data as a list of
vectors that we get after splitting the elements by the delimiter `>`

:

strsplit(data, ">") #> [[1]] #> [1] "B" "B" "C" "A" #> #> [[2]] #> [1] "C" "B" "B" "A" "A" "A" "C" "B" "C" #> #> [[3]] #> [1] "C" "C" "B" "C" "C" "A"

Now, the core of the problem is to find all consecutive sequences of a given length from a single vector. Once we can do that, it's simple to apply over the list of data that we have; transforming back to the delimited format will also be simple.

With that goal in mind, we can then make a function for extracting the sequences; here we just loop over each element and extract all sequences of the given length to a list:

seqs <- function(x, length = 2) { if (length(x) < length) return(NULL) k <- length - 1 lapply(seq_len(length(x) - k), function(i) x[i:(i + k)]) }

We can now just apply the function accross the data after
splitting the delimited characters into vectors to get the result. We also need an additional `sapply`

with `paste`

to transform the data back into the delimited format that we started with:

lapply(strsplit(data, ">"), function(x) { sapply(seqs(x, 3), paste, collapse = ">") }) #> [[1]] #> [1] "B>B>C" "B>C>A" #> #> [[2]] #> [1] "C>B>B" "B>B>A" "B>A>A" "A>A>A" "A>A>C" "A>C>B" "C>B>C" #> #> [[3]] #> [1] "C>C>B" "C>B>C" "B>C>C" "C>C>A"

Further, to get sequences of multiple lengths at the same time, we can add another layer of iteration:

lapply(strsplit(data, ">"), function(x) { unlist(sapply(c(2, 3), function(n) { sapply(seqs(x, n), paste, collapse = ">") })) }) #> [[1]] #> [1] "B>B" "B>C" "C>A" "B>B>C" "B>C>A" #> #> [[2]] #> [1] "C>B" "B>B" "B>A" "A>A" "A>A" "A>C" "C>B" "B>C" #> [9] "C>B>B" "B>B>A" "B>A>A" "A>A>A" "A>A>C" "A>C>B" "C>B>C" #> #> [[3]] #> [1] "C>C" "C>B" "B>C" "C>C" "C>A" "C>C>B" "C>B>C" "B>C>C" "C>C>A"

Created on 2018-05-21 by the reprex package (v0.2.0).

**Generate All Combinations of n Elements, Taken m at a Time,** If argument FUN is not NULL , applies a function given by the argument to each point. If simplify is FALSE, returns a list; otherwise returns an array , typically a variables to count unique values of wt_var optional variable to weight by - if this is non-NULL, count will sum up the value of this variable for each combination of id variables.

Using the `stringr`

package (or regex in general).

library(stringr) str_extract_all(p, "(\\w+)>(\\w+)>(\\w+)")

With overlap, but the code could be simplified.

str_extract_all_overlap <- function (x) { extractions <- character() x_curr <- x extr <- str_extract(x_curr, "(\\w+)>(\\w+)>(\\w+)") i = 1 while (!is.na(extr)) { extractions[i] <- extr x_curr <- str_replace(x_curr, "\\w+", replacement = "") extr <- str_extract(x_curr, "(\\w+)>(\\w+)>(\\w+)") i = i + 1 } return(extractions) } lapply(p, str_extract_all_overlap)

**combn: Generate All Combinations of n Elements, Taken m at a Time,** If x is a positive integer, returns all combinations of the elements of seq(x) taken array, typically a matrix. are passed unchanged to the FUN function, if specified. small changes by the R core team, notably to return an array in all cases of Paths in CRAN-like Repositories count.fields: Count the Number of Fields per Asking for the number of arrangements of scoops and arrows is actually the same as asking for the number of combinations without repetition/replacement for n = 5 and r = 3: However, for our original question we had n = 3 and r = 3; we need to make n = 5. It turns out that r + (n – 1) will give us the 5

You could also adapt the `paste`

-command in your second `sapply`

to:

paste(head(x,-2), head(tail(x,-1),-1), tail(x,-2) , sep = ">")

Your full code should now look like:

split_fn <- sapply(p , strsplit , split = ">", USE.NAMES = FALSE) split_fn <- sapply(split_fn, function(x) paste(head(x,-2), head(tail(x,-1),-1), tail(x,-2), sep = ">") )

The result:

> split_fn [[1]] [1] "Social>PaidSearch>PaidSearch" "PaidSearch>PaidSearch>PaidSearch" "PaidSearch>PaidSearch>PaidSearch" [4] "PaidSearch>PaidSearch>PaidSearch" "PaidSearch>PaidSearch>PaidSearch" "PaidSearch>PaidSearch>PaidSearch" [7] "PaidSearch>PaidSearch>PaidSearch" "PaidSearch>PaidSearch>PaidSearch" "PaidSearch>PaidSearch>PaidSearch" [10] "PaidSearch>PaidSearch>OrganicSearch" "PaidSearch>OrganicSearch>OrganicSearch" "OrganicSearch>OrganicSearch>OrganicSearch" [[2]] [1] "Referral>Referral>Referral"

**expand.grid: Create a Data Frame from All Combinations of Factor ,** attrs" is a list which gives the dimension and dimnames for use by predict methods. Note. Conversion to a factor is done with levels in the order they occur in the Generate all combinations of the elements of x taken m at a time. If x is a positive integer, returns all combinations of the elements of seq(x) taken m at a time. If argument FUN is not NULL, applies a function given by the argument to each point. If simplify is FALSE, returns a list; otherwise returns an array, typically a matrix.

**[PDF] Package 'combinat',** Generate all combinations of the elements of x taken m at a time. vector or an array. "" are passed unchanged to function given by argument fun, if any. Code by Scott Chasalow, R package and doc prep by Vince Carey, stvjc@channing.harvard.edu lexicographic order, for the objects represented by the counts in x. Print all possible combinations of r elements in a given array of size n; Product of all Subarrays of an Array; Check whether a number can be represented as difference of two squares; Maximum number of unique values in the array after performing given operations; How is the time complexity of Sieve of Eratosthenes is n*log(log(n))? Perfect Sum Problem

**Print all possible strings of length k that can be formed from a set of n ,** Given a set of characters and a positive integer k, print all possible strings of length For a given set of size n, there will be n^k possible strings of length k. two Binary strings · Number of ways in which the substring in range [L, R] can be Print all permutations with repetition of characters · Count number of binary strings I have a list of integers in my C# program. However, I know the number of items I have in my list only at runtime. Let us say, for the sake of simplicity, my list is {1, 2, 3} Now I need to genera

**Print all possible combinations of r elements in a given array of size n,** Given an array of size n, generate and print all possible combinations of r elements in array. Java program to print all combination of size r in an array of size n public static void main (String[] args) { the combinations of a string in lexicographical order · Print all combinations of points that can compose a given number The first is definitely that it is a more readily used format than the table output. The second is that sometimes I want to count the number of elements "in a row" rather than within the whole dataset. For example, c (rep ('A', 3), rep ('G', 4), 'A', rep ('G', 2), rep ('C', 10)) would return values = c ('A','G','A','G','C')