Last week I spent a bit of time working on a of Fuzzy Logic type problem. This was for a VB app, rather than anything to do with Excel, but I wanted to test the different methods quickly and in a flexibly way. Excel was obviously ideal for doing this.
I have complied about 5 methods into a workbook (very, very rough stuff mind), none of the code is my own work, it’s all just gathered from around the net, I thought it might be useful for other folks (and myself again at some point!) to have it all in one work book let me know if you tied it up. The algorithms are:
Levenshtein distance
SoundEx
SoundEx2
SoundEx – Daitch-Mokotoff
Metaphone
DoubleMetaphone
So that’s 6 ways then not 5 as I may have lead you to believe. The fastest by far is standard SoundEx (it should be, it much simpler than the rest!), the others have various strengths and weakness.
If you have any more, let me know!!!!
Fuzzy tests.xls
I’m not in a position to test this out yet, but I look forward to it. I can’t say I’ve much luck with the soundex type alogrithms in the past, though. I’ve spent lots of time trying to match lists where one name is “rosenstein,jonathan” and the other is “rossenstein, jon dr”. I’ve never seen much of a breakthrough in this area. It’s pretty frustrating. You’d think with all the work put into things such as cataloging tools that there’d be something brilliant by now.
By the way, congrats on being blocked by Websense. You’ve made the exclusive club!
I’ve always found SoundEx hard to understand. This page has some interesting links:
http://www.dailydoseofexcel.com/archives/2004/06/16/fuzzy-text-match/
I also have a UDF somewhere (can’t locate it at the moment) that returns a percentage indicating how close two words are to each other.
–JP
OK here it is:
[VBA]
Function Str_Comp(st1 As String, st2 As String) As Double
‘
‘ returns a number showing % comparison between two names
‘
‘ i.e. =Str_Comp(A1, B1)
‘
‘ Format cell as Percentage to make it look pretty!!
‘
‘
Dim MtchTbl(100, 100)
Dim MyMax As Double, ThisMax As Double
Dim i As Integer, j As Integer, ii As Integer, jj As Integer
With WorksheetFunction
st1$ = Trim$(.Proper(st1$))
st2$ = Trim$(.Proper(st2$))
End With
MyMax# = 0
For i% = Len(st1$) To 1 Step -1
For j% = Len(st2$) To 1 Step -1
If Mid$(st1$, i%, 1) = Mid$(st2$, j%, 1) Then
ThisMax# = 0
For ii% = (i% + 1) To Len(st1$)
For jj% = (j% + 1) To Len(st2$)
If MtchTbl(ii%, jj%) > ThisMax# Then
ThisMax# = MtchTbl(ii%, jj%)
End If
Next jj%
Next ii%
MtchTbl(i%, j%) = ThisMax# + 1
If (ThisMax# + 1) > ThisMax# Then
MyMax# = ThisMax# + 1
End If
End If
Next j%
Next i%
Str_Comp = MyMax# / ((Len(st1$) + Len(st2$)) / 2)
End Function
[/VBA]
And here’s another link to fuzzy lookup code:
http://hairyears.livejournal.com/115867.html
Thanks JP,
Can see you’ve done a bit of work in this area too! I’ve found a few more methods flotting around, so i will update those in to that SS some, I think unless you’re an expert with these things then it’s all a bit trial and error!
Just FTR, I didn’t write that UDF. It’s extremely useful, if you happen to need the functions.
Thx,
JP
Oh and here’s another one:
http://chandoo.org/wp/2008/09/25/handling-spelling-mistakes-in-excel-fuzzy-search/
Fuzzy Lookups and Fuzzy Matching in Excel:
I’m late to the game in posting this, but a few more links will help anyone arriving via Google when they’re looking for a Fuzzy-Match algorithm for Excel.
I wrote and implemented a fuzzy-logic VLookup function on a long flight a few years ago. The details are here:
http://excellerando.blogspot.com/2010/03/vlookup-with-fuzzy-matching-to-get.html
You are correct in placing the Levenshtein ‘edit distance’ algorithm at the top of the list: it is the gold standard for measuring similarity in passages of text… It is also somewhat ‘gold-plated’ in terms of weight and expense! A ‘sum-of-common-strings’ approach is perfectly adequate and well within the reach of VBA; So, after some work with modified Levenshtein algorithms – which one of your correspondents has linked to in a previous comment – I published a set of Match and Lookup functions using the Sum-of-common-strings approach.
There’s a discussion of the theoretical constraints in the previous post: please feel free to add the lessons you’ve learned the different approaches you came up with. I never got anything useful out of SoundEx for the complex lookups – addresses, especially – that we get in real-life applications.
Often an excellent submit after i see this web site along with other websites you own. Understand your current skills.