Fuzzy string matching of a list of character vectors to a character vector

agrep r
fuzzy string matching in r

I have a list of character vectors and a single character vector. I would like to perform a fuzzy matching in R between each element of the list (a character vector) to each element of a character vector (a character string) and return the maximum similarity score for each combination. Below is a toy example:

a <- c("brown fox", "lazy dog", "white cat", "I don't know", "sunset", "never mind", "excuse me")
b <- c("very late", "do not cross", "sunrise", "long vacation")
c <- c("toy example", "green apple", "tall building", "good rating", "accommodating")
mylist <- list(a,b,c)

charvec <- c("brown dog", "lazy cat", "white dress", "I know that", "excuse me please", "tall person", "new building", "good example", "green with envy", "zebra crossing")

Now, I would like to fuzzy match each element in mylist with the first string in charvec and return the maximum similarity score out of the 7 scores. Likewise, I want to obtain the score for each combination of mylist and charvec.

My attempt so far:

Convert the strings in charvec to the column names of an empty data frame

df <- setNames(data.frame(matrix(ncol = 10, nrow = 3)), c(charvec))

Calculate the maximum similarity score between each combination using jarowinkler distance from RecordLinkage package (or if there is a better distance measure for matching phrases!!)

for (j in seq_along(mylist)) {
  for (i in length(ncol(df))) {
    df[[i,j]] <- max(jarowinkler(names(df)[i], mylist[[j]]))
  }
}

But unfortunately, I get only 3 scores in the first row with the rest of the values as NA.

Any help on this would be highly appreciated.


using purrr package

mylist <- setNames(mylist, c('a', 'b', 'c'))

library(purrr)

map_dfr(charvec,
    function(wrd, vec_list){
      setNames(map_df(vec_list, ~max(jarowinkler(wrd, .x))),
               names(vec_list)
      )

    },
    mylist)

# A tibble: 10 x 3
       a     b     c
   <dbl> <dbl> <dbl>
 1 0.911 0.580 0.603
 2 0.85  0.713 0.603
 3 0.842 0.557 0.515
 4 0.657 0.490 0.409
 5 0.912 0.489 0.659
 6 0.538 0.546 0.801
 7 0.716 0.547 0.740
 8 0.591 0.524 0.856
 9 0.675 0.509 0.821
10 0.619 0.587 0.630

If you'd like it wide:

map_dfc(charvec,
         function(wrd, vec_list) {
          set_names(list(map_dbl(vec_list, ~max(jarowinkler(wrd, .x)))),
                    wrd)
         },
        mylist
)

# A tibble: 3 x 10
  `brown dog` `lazy cat` `white dress` `I know that` `excuse me plea~ `tall person` `new building` `good example`
        <dbl>      <dbl>         <dbl>         <dbl>            <dbl>         <dbl>          <dbl>          <dbl>
1       0.911      0.85          0.842         0.657            0.912         0.538          0.716          0.591
2       0.580      0.713         0.557         0.490            0.489         0.546          0.547          0.524
3       0.603      0.603         0.515         0.409            0.659         0.801          0.740          0.856
# ... with 2 more variables: `green with envy` <dbl>, `zebra crossing` <dbl>

agrep: Approximate String Matching (Fuzzy Matching), pattern. a non-empty character string or a character string containing a regular a numeric vector or list with names partially matching insertions, deletions and  I have a list of character vectors and a single character vector. I would like to perform a fuzzy matching in R between each element of the list (a character vector) to each element of a character vector (a character string) and return the maximum similarity score for each combination. Below is a toy example:


First a helper function that returns the best match for a word given a character vector to check against. I'm using purrr package for mapping function as I prefer it over looping.

library(purrr)
library(magrittr)
library(RecordLinkage)
a <- c("brown fox", "lazy dog", "white cat", "I don't know", "sunset", "never mind", "excuse me")
charvec <- c("brown dog", "lazy cat", "white dress", "I know that", "excuse me please", "tall person", "new building", "good example", "green with envy", "zebra crossing")

getBestMatch <- function(word, vector){
  purrr::map_dbl(charvec, ~RecordLinkage::jarowinkler(word, .x)) %>%
    magrittr::set_names(charvec) %>%
    which.max %>%
    names
}

Running the function produces following output:

> getBestMatch("brown fox", charvec)
[1] "brown dog"

Now that we have a helper function it's just a matter of calling it over elements of the vector.

>map_chr(a, ~ getBestMatch(.x, charvec))
[1] "brown dog"        "lazy cat"         "white dress"      "I know that"     
[5] "I know that"      "new building"     "excuse me please"

Approximate String Matching (Fuzzy Matching), pattern, a non-empty character string to be matched (not a regular expression!) value, if FALSE , a vector containing the (integer) indices of the matches integer not less than the corresponding fraction), or a list with possible components. Coerced by as.character to a character vector if possible. ignore.case: if FALSE, the pattern matching is case sensitive and if TRUE, case is ignored during matching. value: if FALSE, a vector containing the (integer) indices of the matches determined is returned and if TRUE, a vector containing the matching elements themselves is returned. max.distance


library(stringdist)

dist <- stringdistmatrix( df$text, charvec ,method = "lcs" )
row.names( dist ) <- as.character( df$text )
colnames( dist ) <- charvec

I used lcs in this example; Longest Common Substring distance.

I encourage you to check out other methods. ?"stringdist-metrics"

the lower the distance, the better the match...

> dist
#               brown dog lazy cat white dress I know that excuse me please tall person new building good example green with envy zebra crossing
# brown fox             4       15          16          14               23          14           17           15              18             15
# lazy dog              9        6          15          15               20          13           14           18              21             14
# white cat            14        9           8          12               19          16           17           17              16             17
# I don't know         13       16          19          11               24          17           18           20              19             20
# sunset               13       12          13          13               16          13           14           16              17             16
# never mind           13       16          15          17               18          15           12           18              15             14
# excuse me            16       15          14          18                7          16           17           13              16             17
# very late            14        9          14          14               15          16           15           15              16             17
# do not cross         13       16          13          15               22          15           20           18              21             14
# sunrise              14       15          14          16               17          14           15           17              16             17
# long vacation        14       11          22          16               25          16           17           19              20             19
# toy example          16       13          16          16               15          14           19            5              20             21
# green apple          14       15          16          16               15          16           17           11              12             21
# tall building        16       17          18          20               25          12            7           21              22             17
# good rating          14       13          18          14               23          16           15           11              18             15
# accommodating        16       13          22          18               23          18           17           17              24             15

How to quasi match two vectors of strings (in R)?, The fuzzywuzzyR package is a fuzzy string matching implemenation of Compute the approximate string distance between character vectors. # no-frills fuzzy matching of strings between character vectors # `a` and `b` (essentially a wrapper around a stringdist function) # The function returns a two column matrix giving the matching index # (as `match` would return) and a matrix giving the distances, so you # can check how well it did on the hardest words. # Warning - this uses all


Approximate String Distances, Compute the approximate string distance between character vectors. a numeric vector or list with names partially matching insertions, deletions and  Fuzzy String Matching in Python In this tutorial, you will learn how to approximately match strings and determine how similar they are by going over various examples. Have you ever wanted to compare strings that were referring to the same thing, but they were written slightly different, had typos or were misspelled?


Value Matching, match returns a vector of the positions of (first) matches of its first argument in its Factors, raw vectors and lists are converted to character vectors, and then x pmatch and charmatch for (partial) string matching, match.arg , etc for function  a non-empty character string or a character string containing a regular expression (for fixed = FALSE) to be matched. Coerced by as.character to a string if possible. character vector where matches are sought. Coerced by as.character to a character vector if possible. Maximum distance allowed for a match.


[PDF] Package 'stringdist', Title Approximate String Matching and String Distance Functions string distances and to do approximate text matching between character vectors. (list of) integer or numeric vector(s) serving as lookup table for matching. Approximate string matching. In computer science, approximate string matching (often colloquially referred to as fuzzy string searching) is the technique of finding strings that match a pattern approximately (rather than exactly).