What is fuzzy matching? – TechNative

Fuzzy matching (FM), also known as fuzzy logic, fuzzy string matching, fuzzy name matching, or fuzzy string matching, is an artificial intelligence and machine learning technology that identifies similar, but not identical, in sets of data tables.

FM uses an algorithm to navigate between absolute rules to find duplicate strings, words/entries, which do not immediately share the same characteristics. Where typical search logic works on a binary pattern (i.e.: 0:1, yes/no, true/false, etc.) – fuzzy string matching instead finds strings, entries and/or text in datasets that fall between the two. of these final parameters and navigates towards intermediate degrees of truth.

Fuzzy String Match helps find fuzzy matches even when certain words are misspelled, abbreviated, or omitted, a feature widely used in search engines. Ultimately, the approximate string match provides a match score, and since it is used to identify words, phrases, and strings that do not match perfectly, the match score will not be 100%.

How does fuzzy matching work?

Landing on the right fuzzy matching algorithm is important to help determine the similarity between one string and another. In one case, you can have a single character distance between “essay” and “trail”, or search for “passport” when the existing string says “passport” – a typo. Of course, not every case of fuzzy logic will be a matter of single-character distance. “Martin Luther Junior” is quite similar to “Martin Luther King, Jr.” Distances vary and there are various fuzzy name matching algorithms to help fill these gaps.

Performing a fuzzy logic search with loosely defined rules for string matching has some drawbacks. Using a weak system increases the risk of false positives. In order to keep these false positives to a bare minimum, or ideally non-existent, your fuzzy string matching system needs to be rather holistic. It must account for misspellings, abbreviations, name variations, geographic spellings of certain names, abbreviated nicknames, acronyms, and many other variables.

Fuzzy name matching algorithms

While there are many string matching algorithms to choose from when reconciling datasets, there is no one-size-fits-all solution for all use cases. Here are some of the most reliable and often used string matching techniques in data science for finding approximate matches.

Distance from Levenshtein

Levenshtein distance (LD) is one of the fuzzy matching techniques that measures between two strings, the given number representing how far the two strings are from an exact match. The higher the number of the Levenshtein edit distance, the further the two terms are from identity.

For example, if you measure the distance between “Cristian” and “Christian”, you would have a distance of 1 since you would be one “h” away from an exact match. This term is often interchangeable with the term “edit distance”.

Examples of Levenshtein modification distance

  1. Power -> Powder (Insert “w”) – Distance: 1
  2. Lovin -> Loving (Insert “g”) – Distance: 1
  3. Porpoise -> Goal (Substitute “o” for “u”, Insert “i”) – Distance: 2

Distance from Hamming

Named after American mathematician Richard Hamming, Hamming distance (HD) is quite similar to Levenshtein except that it is mainly used in signal processing, whereas the former is often used to calculate distance in strings textual. This algorithm uses the American Standard Code for Information Interchange (ASCII) table to determine the binary code assigned to each letter in each string to calculate the distance score.

Hamming distance examples

Take the text strings “Corn” and “Cork”. If you are trying to find the HD between these two, your answer would be a distance of 2, not 1, as you would get with Levenshtein’s algorithm. To get this score, you have to look at the binary assignment of each letter, one by one. Since the ASCII Binary Character Table assigns the code (01101110) for N and (01101011) for K, you will notice that the difference between each letter’s code occurs in two places, so an HD of 2.

Damerau-Levenshtein

This LD variant also finds the minimum number of operations needed to make two strings a direct match, using single-character distance operations like insertion, deletion, and substitution, but Damerau-Levenshtein goes one step further by incorporating a fourth possible operation – the transposition of two characters to find an approximate match.

Damerau–Levenshtein example

String 1: Michael

Channel 2: Michaela

Operation 1: transposition: swap the characters “a” and “e”

Operation 2: insert “a” (end of string 2)

Range = 2

Each operation has a count of “1”, so each insertion, deletion, transposition, etc. is weighted equally.

Fuzzy Matching Use Cases

The use cases for FM are vast, with many real-world applications, deduplication being one of the most popular among them. Imagine streaming the same digital ad to a user who has already reacted negatively to that ad and favorably to another. How would the user experience be affected if a financial institution imposed fraud detection on a transaction that the user repeats every week? It is the use of fuzzy string matching that has enabled deduplication to streamline records in so many of our modern data systems.

When we launched RediSearch in 2016, one of its main features was an auto-suggest engine with FM. Anyone who’s ever surfed the web has seen auto-suggest in action on a search engine. Speaking of search engines, have you ever misspelled a word when searching Google, but still got the results you were looking for? Google will actually serve what it thinks you wanted to type as the main query while also providing an option to search for the word(s) as you typed them right below. In this way, fuzzy matching has helped shape how AI/ML has helped improve our most trusted search engines.

Benefits of fuzzy matching

Research has found that human error is the source of a significant amount of duplication that occurs in record keeping and data management. An Online Research Journal study on the outlook for health information management found that duplicate medical records are not only common, but also dangerous and costly. The study, led by Beth Haenke Just, MBA, RHIA, FAHIMA, used a multisite dataset of 398,939 patient records and found that the majority of name field mismatches were the result of misspellings ( 54.14% in first name fields, 33.62% in surname fields). domains and 58.3% for middle names). Human error is often the biggest obstacle to data management and record linkage. FM has become an indispensable tool for joining imprecise datasets in medical, financial services, social security fraud identification, and more. Ultimately, FM has saved modern enterprises countless hours of labor on the often costly and painstaking work of manual deduplication.

Other benefits of FM include:

Precision: FM is much more granular than deterministic matching, with the ability to find matches using imprecise data, penetrating deeper than regular binary strings

Flexibility: The different fuzzy logic algorithms available allow to solve the most complex problems

Easy to build: Implementing fuzzy logic in your system is a simple process

Configurable: It is easy to modify the logic according to your specific needs

Implementing fuzzy matching in different programming languages

Fuzzy Matching algorithms can be implemented in various programming languages ​​such as:

Python – Many choose to use the Fuzzywuzzy Python library when trying to do fuzzy string matching. This library uses the LD algorithm by default

R – Mainly used for statistical calculation and graphics

Java – A little more complicated to implement FM in Java, but not impossible! This GitHub repository houses a Java implementation of this same Fuzzywuzzy library

Excel – Via add-ons such as Fuzzy Lookup, Exis Echo, and even using the VLOOKUP function

Implementations are similar, with all languages ​​comparing sets, matching patterns, and determining statistical distance from perfect matching.

How to Minimize Errors in Fuzzy Search

With FM, reliability is not an infallible guarantee. Sometimes false positives appear which require manual error checking. It’s important to ask: will a few false positives outweigh the benefit of correctly matching exponentially more data? If it’s negligible, maybe spending time manually checking for errors wouldn’t be time well spent. Matching the right algorithm and programming language with the right use case is the best way to avoid errors when applying fuzzy logic to data matching.


About the Author

Eric Silva is Head of SaaS Product Marketing at Say it again. Redis accelerates applications by creating a database for a real-time world. It’s the driving force behind Open-Source Redis, the world’s most popular in-memory database, and the commercial provider of Redis Enterprise, a real-time data platform. Redis Enterprise provides real-time services to over 8,000 organizations worldwide. It’s built on the unparalleled simplicity and speed of open-source Redis, plus an enterprise-grade data platform that delivers the robustness of modern data models, manageability, automation, performance, and resiliency. to deploy and run modern applications at any scale from anywhere on the planet.

Featured Image: Adobe Stock


Comments are closed.