R: Count all combinations in a list of strings (Specific Order)

r all possible combinations of two vectors
r combinations without repetition
r create dataframe with all combinations
r find all combinations of two columns
r all combinations of two data frames
r combn examples
r create dataframe with all combinations of variables
r all unique combinations of two vectors

I am trying to count all sequences in a large list of characters delimetered by ">" but only the combinations that are directly next to each other.

e.g. given the character vector:

[1]Social>PaidSearch>PaidSearch>PaidSearch>PaidSearch>PaidSearch>PaidSearch>PaidSearch>PaidSearch>PaidSearch>PaidSearch>OrganicSearch>OrganicSearch>OrganicSearch
[2]Referral>Referral>Referral

I can run the following line to retrieve all combinations with of 2 characters:

split_fn <- sapply(p , strsplit , split = ">", perl=TRUE)

split_fn <- sapply(split_fn, function(x) paste(head(x,-1) , tail(x,-1) , sep = ">") )

Returns:

[[1]]

 [1] "Social>PaidSearch"           "PaidSearch>PaidSearch"       "PaidSearch>PaidSearch"       "PaidSearch>PaidSearch"       "PaidSearch>PaidSearch"      
 [6] "PaidSearch>PaidSearch"       "PaidSearch>PaidSearch"       "PaidSearch>PaidSearch"       "PaidSearch>PaidSearch"       "PaidSearch>PaidSearch"      
[11] "PaidSearch>OrganicSearch"    "OrganicSearch>OrganicSearch" "OrganicSearch>OrganicSearch"

[[2]]

[1] "Referral>Referral" "Referral>Referral"

Which is all possible 2 character sequences in my data (splits in order)

I know want to have all possible outcomes of 3 characters.

e.g.

"Social>PaidSearch>PaidSearch" "PaidSearch>PaidSearch>PaidSearch"..."Referral>Referral>Referral"

Tried to use

unlist(lapply(strsplit(p, split = ">"), function(i) combn(sort(i), 3, paste, collapse='>')))

But it returns all combinations including those that aren't directly following.

I also don't want it to return combinations of the last value in row one with the first value in row 2 etc.


Let's start with creating some data:

set.seed(1)

data <- lapply(1:3, function(i) sample(LETTERS[1:3], rpois(1, 6), re = T))
data <- sapply(data, paste, collapse = ">")

data
#> [1] "B>B>C>A"           "C>B>B>A>A>A>C>B>C" "C>C>B>C>C>A"

Given the problem, it makes sense to think of these data as a list of vectors that we get after splitting the elements by the delimiter >:

strsplit(data, ">")
#> [[1]]
#> [1] "B" "B" "C" "A"
#> 
#> [[2]]
#> [1] "C" "B" "B" "A" "A" "A" "C" "B" "C"
#> 
#> [[3]]
#> [1] "C" "C" "B" "C" "C" "A"

Now, the core of the problem is to find all consecutive sequences of a given length from a single vector. Once we can do that, it's simple to apply over the list of data that we have; transforming back to the delimited format will also be simple.

With that goal in mind, we can then make a function for extracting the sequences; here we just loop over each element and extract all sequences of the given length to a list:

seqs <- function(x, length = 2) {
  if (length(x) < length)
    return(NULL)
  k <- length - 1
  lapply(seq_len(length(x) - k), function(i) x[i:(i + k)])
}

We can now just apply the function accross the data after splitting the delimited characters into vectors to get the result. We also need an additional sapply with paste to transform the data back into the delimited format that we started with:

lapply(strsplit(data, ">"), function(x) {
  sapply(seqs(x, 3), paste, collapse = ">")
})
#> [[1]]
#> [1] "B>B>C" "B>C>A"
#> 
#> [[2]]
#> [1] "C>B>B" "B>B>A" "B>A>A" "A>A>A" "A>A>C" "A>C>B" "C>B>C"
#> 
#> [[3]]
#> [1] "C>C>B" "C>B>C" "B>C>C" "C>C>A"

Further, to get sequences of multiple lengths at the same time, we can add another layer of iteration:

lapply(strsplit(data, ">"), function(x) {
  unlist(sapply(c(2, 3), function(n) {
    sapply(seqs(x, n), paste, collapse = ">")
  }))
})
#> [[1]]
#> [1] "B>B"   "B>C"   "C>A"   "B>B>C" "B>C>A"
#> 
#> [[2]]
#>  [1] "C>B"   "B>B"   "B>A"   "A>A"   "A>A"   "A>C"   "C>B"   "B>C"  
#>  [9] "C>B>B" "B>B>A" "B>A>A" "A>A>A" "A>A>C" "A>C>B" "C>B>C"
#> 
#> [[3]]
#> [1] "C>C"   "C>B"   "B>C"   "C>C"   "C>A"   "C>C>B" "C>B>C" "B>C>C" "C>C>A"

Created on 2018-05-21 by the reprex package (v0.2.0).

Generate All Combinations of n Elements, Taken m at a Time, If argument FUN is not NULL , applies a function given by the argument to each point. If simplify is FALSE, returns a list; otherwise returns an array , typically a  variables to count unique values of wt_var optional variable to weight by - if this is non-NULL, count will sum up the value of this variable for each combination of id variables.


Using the stringr package (or regex in general).

library(stringr)
str_extract_all(p, "(\\w+)>(\\w+)>(\\w+)")

With overlap, but the code could be simplified.

str_extract_all_overlap <- function (x) {
  extractions <- character()
  x_curr <- x
  extr <- str_extract(x_curr, "(\\w+)>(\\w+)>(\\w+)")
  i = 1
  while (!is.na(extr)) {
    extractions[i] <- extr 
    x_curr <- str_replace(x_curr, "\\w+", replacement = "")
    extr <- str_extract(x_curr, "(\\w+)>(\\w+)>(\\w+)")
    i = i + 1
  }
  return(extractions)
}

lapply(p, str_extract_all_overlap)

combn: Generate All Combinations of n Elements, Taken m at a Time, If x is a positive integer, returns all combinations of the elements of seq(x) taken array, typically a matrix. are passed unchanged to the FUN function, if specified​. small changes by the R core team, notably to return an array in all cases of Paths in CRAN-like Repositories count.fields: Count the Number of Fields per  Asking for the number of arrangements of scoops and arrows is actually the same as asking for the number of combinations without repetition/replacement for n = 5 and r = 3: However, for our original question we had n = 3 and r = 3; we need to make n = 5. It turns out that r + (n – 1) will give us the 5


You could also adapt the paste-command in your second sapply to:

paste(head(x,-2), head(tail(x,-1),-1), tail(x,-2) , sep = ">")

Your full code should now look like:

split_fn <- sapply(p , strsplit , split = ">", USE.NAMES = FALSE)

split_fn <- sapply(split_fn, function(x) paste(head(x,-2), head(tail(x,-1),-1), tail(x,-2), sep = ">") )

The result:

> split_fn
[[1]]
 [1] "Social>PaidSearch>PaidSearch"              "PaidSearch>PaidSearch>PaidSearch"          "PaidSearch>PaidSearch>PaidSearch"         
 [4] "PaidSearch>PaidSearch>PaidSearch"          "PaidSearch>PaidSearch>PaidSearch"          "PaidSearch>PaidSearch>PaidSearch"         
 [7] "PaidSearch>PaidSearch>PaidSearch"          "PaidSearch>PaidSearch>PaidSearch"          "PaidSearch>PaidSearch>PaidSearch"         
[10] "PaidSearch>PaidSearch>OrganicSearch"       "PaidSearch>OrganicSearch>OrganicSearch"    "OrganicSearch>OrganicSearch>OrganicSearch"

[[2]]
[1] "Referral>Referral>Referral"

expand.grid: Create a Data Frame from All Combinations of Factor , attrs" is a list which gives the dimension and dimnames for use by predict methods. Note. Conversion to a factor is done with levels in the order they occur in the  Generate all combinations of the elements of x taken m at a time. If x is a positive integer, returns all combinations of the elements of seq(x) taken m at a time. If argument FUN is not NULL, applies a function given by the argument to each point. If simplify is FALSE, returns a list; otherwise returns an array, typically a matrix.


[PDF] Package 'combinat', Generate all combinations of the elements of x taken m at a time. vector or an array. "" are passed unchanged to function given by argument fun, if any. Code by Scott Chasalow, R package and doc prep by Vince Carey, stvjc@​channing.harvard.edu lexicographic order, for the objects represented by the counts in x. Print all possible combinations of r elements in a given array of size n; Product of all Subarrays of an Array; Check whether a number can be represented as difference of two squares; Maximum number of unique values in the array after performing given operations; How is the time complexity of Sieve of Eratosthenes is n*log(log(n))? Perfect Sum Problem


Print all possible strings of length k that can be formed from a set of n , Given a set of characters and a positive integer k, print all possible strings of length For a given set of size n, there will be n^k possible strings of length k. two Binary strings · Number of ways in which the substring in range [L, R] can be Print all permutations with repetition of characters · Count number of binary strings  I have a list of integers in my C# program. However, I know the number of items I have in my list only at runtime. Let us say, for the sake of simplicity, my list is {1, 2, 3} Now I need to genera


Print all possible combinations of r elements in a given array of size n, Given an array of size n, generate and print all possible combinations of r elements in array. Java program to print all combination of size r in an array of size n public static void main (String[] args) { the combinations of a string in lexicographical order · Print all combinations of points that can compose a given number  The first is definitely that it is a more readily used format than the table output. The second is that sometimes I want to count the number of elements "in a row" rather than within the whole dataset. For example, c (rep ('A', 3), rep ('G', 4), 'A', rep ('G', 2), rep ('C', 10)) would return values = c ('A','G','A','G','C')