Fuzzy matching using T-SQL

sql server 2017 fuzzy matching
sql matching algorithm
sql server levenshtein
sql soundex
mysql fuzzy match
oracle sql fuzzy match
fuzzy lookup in sql server
sql server full text search fuzzy matching

I have a table Persons with personaldata and so on. There are lots of columns but the once of interest here are: addressindex, lastname and firstname where addressindex is a unique address drilled down to the door of the apartment. So if I have 'like below' two persons with the lastname and one the firstnames are the same they are most likely duplicates.

I need a way to list these duplicates.

tabledata:

personid     1
firstname    "Carl"
lastname     "Anderson"
addressindex 1

personid     2
firstname    "Carl Peter"
lastname     "Anderson"
addressindex 1

I know how do this if I were to match exactly on all columns but I need fuzzy match to do the trick with (from the above example) a result like:

Row     personid      addressindex     lastname     firstname
1       2             1                Anderson     Carl Peter
2       1             1                Anderson     Carl
.....

Any hints on how to solve this in a good way?

I've found that the stuff SQL Server gives you to do fuzzy matching is pretty clunky. I've had really good luck with my own CLR functions using the Levenshtein distance algorithm and some weighting. Using that algorithm, I've then made a UDF called GetSimilarityScore that takes two strings and returns a score between 0.0 and 1.0. The closer to 1.0 the match is, the better. Then, query with a threshold of >=0.8 or so to get the most likely matches. Something like this:

if object_id('tempdb..#similar') is not null drop table #similar
select a.id, (
    select top 1 x.id
   from MyTable x
   where x.id <> a.id
   order by dbo.GetSimilarityScore(a.MyField, x.MyField) desc
) as MostSimilarId
into #similar
from MyTable a

select *, dbo.GetSimilarityScore(a.MyField, c.MyField)
from MyTable a
join #similar b on a.id = b.id
join MyTable c on b.MostSimilarId = c.id

Just don't do it with really large tables. It's a slow process.

Here's the CLR UDFs:

''' <summary>
''' Compute the distance between two strings.
''' </summary>
''' <param name="s1">The first of the two strings.</param>
''' <param name="s2">The second of the two strings.</param>
''' <returns>The Levenshtein cost.</returns>
<Microsoft.SqlServer.Server.SqlFunction()> _
Public Shared Function ComputeLevenstheinDistance(ByVal string1 As SqlString, ByVal string2 As SqlString) As SqlInt32
    If string1.IsNull OrElse string2.IsNull Then Return SqlInt32.Null
    Dim s1 As String = string1.Value
    Dim s2 As String = string2.Value

    Dim n As Integer = s1.Length
    Dim m As Integer = s2.Length
    Dim d As Integer(,) = New Integer(n, m) {}

    ' Step 1
    If n = 0 Then Return m
    If m = 0 Then Return n

    ' Step 2
    For i As Integer = 0 To n
        d(i, 0) = i
    Next

    For j As Integer = 0 To m
        d(0, j) = j
    Next

    ' Step 3
    For i As Integer = 1 To n
        'Step 4
        For j As Integer = 1 To m
            ' Step 5
            Dim cost As Integer = If((s2(j - 1) = s1(i - 1)), 0, 1)

            ' Step 6
            d(i, j) = Math.Min(Math.Min(d(i - 1, j) + 1, d(i, j - 1) + 1), d(i - 1, j - 1) + cost)
        Next
    Next
    ' Step 7
    Return d(n, m)
End Function

''' <summary>
''' Returns a score between 0.0-1.0 indicating how closely two strings match.  1.0 is a 100%
''' T-SQL equality match, and the score goes down from there towards 0.0 for less similar strings.
''' </summary>
<Microsoft.SqlServer.Server.SqlFunction()> _
Public Shared Function GetSimilarityScore(string1 As SqlString, string2 As SqlString) As SqlDouble
    If string1.IsNull OrElse string2.IsNull Then Return SqlInt32.Null

    Dim s1 As String = string1.Value.ToUpper().TrimEnd(" "c)
    Dim s2 As String = string2.Value.ToUpper().TrimEnd(" "c)
    If s1 = s2 Then Return 1.0F ' At this point, T-SQL would consider them the same, so I will too

    Dim flatLevScore As Double = InternalGetSimilarityScore(s1, s2)

    Dim letterS1 As String = GetLetterSimilarityString(s1)
    Dim letterS2 As String = GetLetterSimilarityString(s2)
    Dim letterScore As Double = InternalGetSimilarityScore(letterS1, letterS2)

    'Dim wordS1 As String = GetWordSimilarityString(s1)
    'Dim wordS2 As String = GetWordSimilarityString(s2)
    'Dim wordScore As Double = InternalGetSimilarityScore(wordS1, wordS2)

    If flatLevScore = 1.0F AndAlso letterScore = 1.0F Then Return 1.0F
    If flatLevScore = 0.0F AndAlso letterScore = 0.0F Then Return 0.0F

    ' Return weighted result
    Return (flatLevScore * 0.2F) + (letterScore * 0.8F)
End Function

Private Shared Function InternalGetSimilarityScore(s1 As String, s2 As String) As Double
    Dim dist As SqlInt32 = ComputeLevenstheinDistance(s1, s2)
    Dim maxLen As Integer = If(s1.Length > s2.Length, s1.Length, s2.Length)
    If maxLen = 0 Then Return 1.0F
    Return 1.0F - Convert.ToDouble(dist.Value) / Convert.ToDouble(maxLen)
End Function

''' <summary>
''' Sorts all the alpha numeric characters in the string in alphabetical order
''' and removes everything else.
''' </summary>
Private Shared Function GetLetterSimilarityString(s1 As String) As String
    Dim allChars = If(s1, "").ToUpper().ToCharArray()
    Array.Sort(allChars)
    Dim result As New StringBuilder()
    For Each ch As Char In allChars
        If Char.IsLetterOrDigit(ch) Then
            result.Append(ch)
        End If
    Next
    Return result.ToString()
End Function

''' <summary>
''' Removes all non-alpha numeric characters and then sorts
''' the words in alphabetical order.
''' </summary>
Private Shared Function GetWordSimilarityString(s1 As String) As String
    Dim words As New List(Of String)()
    Dim curWord As StringBuilder = Nothing
    For Each ch As Char In If(s1, "").ToUpper()
        If Char.IsLetterOrDigit(ch) Then
            If curWord Is Nothing Then
                curWord = New StringBuilder()
            End If
            curWord.Append(ch)
        Else
            If curWord IsNot Nothing Then
                words.Add(curWord.ToString())
                curWord = Nothing
            End If
        End If
    Next
    If curWord IsNot Nothing Then
        words.Add(curWord.ToString())
    End If

    words.Sort(StringComparer.OrdinalIgnoreCase)
    Return String.Join(" ", words.ToArray())
End Function

Fuzzy matching using T-SQL, I've found that the stuff SQL Server gives you to do fuzzy matching is pretty clunky​. I've had really good luck with my own CLR functions using the Levenshtein  Fuzzy matching using T-SQL. Ask Question Asked 10 years, 6 months ago. Active 1 year, 2 months ago. Viewed 142k times 66. 33. I have a table

In addition to the other good info here, you might want to consider using the Double Metaphone phonetic algorithm which is generally considered to be better than SOUNDEX.

Tim Pfeiffer details an implementation in SQL in his article Double Metaphone Sounds Great Convert the C++ Double Metaphone algorithm to T-SQL (originally in SQL Mag & then in SQL Server Pro).

That will assist in matching names with slight misspellings, e.g., Carl vs. Karl.

Update: The actual downloadable code seems to be gone, but here's an implementation found on a github repo that appears to have cloned the original code

Cleaning Messy Data in SQL, Part 1: Fuzzy Matching Names, We can do a “fuzzy match” – the process of using algorithms to determine approximate (hence, fuzzy) similarity between two sets of data. Performing this fuzzy  I have Table1.ColA that contains a Varchar(50) sting that I need to fuzzy-lookup against Table2.ColA. And insert the fuzzily-matched value Into Table1.ColA_FuzzyMatched column that originally

I would use SQL Server Full Text Indexing, which will allow you to do searches and return things that not only contain the word but also may have a misspelling.

Doing Fuzzy Searches in SQL Server, When users search the site, even if they are using a single language, Although some of the ways that words become difficult to match are  Fuzzy strings matching using Levenshtein algorithm on SQL Server (T-SQL vs CLR) Levenshtein algorithm is one of possible fuzzy strings matching algorithm. Levenshtein algorithm calculates Levenshtein distance which is a metric for measuring a difference between two strings.

Since the first release of Master Data Services, you've got access to more advanced fuzzy logic algorithms than what SOUNDEX implements. So provided that you've got MDS installed, you'll be able to find a function called Similarity() in the mdq schema (MDS database).

More info on how it works: http://blog.hoegaerden.be/2011/02/05/finding-similar-strings-with-fuzzy-logic-functions-built-into-mds/

SQL fuzzy logic matching step by step example in SQL Server , In this article, we will learn about SQL fuzzy match logic in SQL Server using Master Data Services with example. Fuzzy match logic is used to: Recommend the  Fuzzy Search is the process to discover the records that are related to a search string, even when the search patterns don’t have an exact match.. In this article, we will learn about SQL fuzzy match logic in SQL Server using Master Data Services with example.

I personally use a CLR implementation of the Jaro-Winkler algorithm which seems to work pretty well - it struggles a bit with strings longer than about 15 characters and doesn't like matching email addresses but otherwise is quite good - full implementation guide can be found here

If you are unable to use CLR functions for whatever reasons, maybe you could try running the data through an SSIS package (using the fuzzy transformation lookup) - detailed here

Anyone ever done fuzzy matching in MS SQL? : SQL, I'd like to match patients using a combination of names , address, etc. is MS SQL capable of doing so? Trying to avoid purchasing more products. 17 comments. Fuzzy matching is a powerful feature when working with messily defined data. But nothing is free, and so this more powerful kind of querying will require more time to successfully execute. Find all records that took place between the year 2003 and before April 2005. Retrieve all incident reports from the "TENDERLOIN", "CENTRAL", and "MISSION

Implementing Fuzzy search in SQL server, a description of how to implement Fuzzy Search in SQL Server using are relevant to a search, even when the search criteria doesn't match. Is there any SQL construct that does fuzzy matching ? As an example , if I have the values as Monroe , Monroe Twp , Monroe Township , "Monroe Twp,NJ" , I would like to consider them as one value .

Fuzzy Search – SQLServerCentral, The latest SQL Server articles from SQLServerCentral. Also, I would like the fuzzy search function to be able to match on any strings such as  Doing Fuzzy Searches in SQL Server A series of arguments with developers who insist that fuzzy searches or spell-checking be done within the application rather then a relational database inspired Phil Factor to show how it is done.

Match datasets using Fuzzy Joins in SQL Server , While working with a lot of textual data, it is a very normal use-case where one may need to perform an inexact matching or in other words fuzzy /  This article will describe how to create a T-SQL Stored Procedure that will provide Fuzzy Matching via a Jaro Winkelr algorithm. Specifically we will demonstrate how to implement the Jaro-Winkler

Comments
  • BTW way, it is quite likely it is NOT the same person in the case given. Fathers and sons do live together at times you know.
  • This is always the problem with semi-clever address evaluation algorithms. You can make an assumption, but you can never be sure.
  • Good point although The descition is another issue based on the result of the fuzzy match.
  • @justSteve: I added a new answer proposing a more up-to-date possibility.
  • Not having access to MDS and thanking that I'm not working with Big Data - this looks like a great fit. Highly appreciate the details.
  • @Lèsemajesté Link fixed.
  • That link is now dead... again
  • Thand, i have considered it bit se use standard edition and full text search isn't an option here.
  • Full Text Search is available in all editions of SQL Server 2005 and 2008
  • SQL Server Fulltext indexing does NOT support fuzzy search. When the searched text contains misspellings, it will fail!
  • Full Text Indexing - via the CONTAINS() function - supports fuzzy in the sense of 'multiple terms in the vicinity of one another'. It does not support fuzzy in the sense of 'slightly different spellings'.