top of page
Darshak Shah

Fuzzy String Matching in Python

DARSHAK SHAH MAY 13, 2019


Fuzzy String Matching in Python


How amazing is it to just input an address and get a list of best matched address suggestions! Or detecting the misspelled words! Being a professor, have you ever worried about examining a research paper and getting the similarity percentage to check how much the student has copied from the internet? The logic and the mechanism behind the concepts of Fuzzy Address Matching, Spelling Checkers, and Plagiarism is as interesting as the concepts itself.


The technique involved behind checking the similarity percentage between certain sentences, paragraphs or words isn’t rocket science! Understanding the mechanism behind how a few lines of code can help you find the similarity percentage between given strings, can help you overcome the above mentioned issues. The core of the string matching process is the calculation of edit distance (levenshtein distance). Let’s enhance our knowledge about Levenshtein distance!


LEVENSHTEIN DISTANCE


Also known as the edit distance, Levenshtein Distance (LD) is a parameter used in String matching, that basically measures the minimum number of operations/edits required to change a particular string into some other string. The operations involved in the above process include:


  1. Addition

  2. Substitution

  3. Deletion


Every operation done to change a given string into the other one adds 1 to the Levenshtein distance counter. For an instance, adding an alphabet results in +1. Substituting a letter or removing it also gives +1.


Now let’s have a look at the most used library for string matching — FuzzyWuzzy package.


FUZZYWUZZY LIBRARY OVERVIEW


It is a Python library which has been originally developed by SeatGeek. The core method being used here is the calculation of Levenshtein Distance between two strings.


The library can be installed by using pip:

pip install fuzzywuzzy
pip-install python-Levenshtein

Gaining a deeper understanding about the method of calculating similarity percentages, let’s look at the different types of Fuzz ratios that run the whole process.


TYPES OF FUZZ RATIOS AND ITS MECHANISMS:


Ratio (Simple Ratio)

When you have a very simple set of strings which look almost similar with their words, you can use the simple ratio from the FuzzyWuzzy package.

Here is the code to understand the match using the simple ratio:

String1 = "Humpty Dumpty sat on a wall"String2 = "Humpty Dumpty Sat on a Wall!
fuzz.ratio("Humpty Dumpty sat on a wall", "Humpty Dumpty Sat on a Wall!")
>>> 91

As seen in the above code, the first string matches to the second one with 91%. The difference lies in the missing exclamation mark ‘!’.


This ratio uses a simple technique which involves calculating the edit distance (Levenshtein distance) between two strings. The unique feature of ‘fuzz.ratio’ lies in the fact that it takes into consideration minimal differences existing between both strings. For an instance, it recognizes missing punctuations, case-sensitive words, misspelled words etc.

Partial Ratio


Most of the times the simple ratio won’t work as it is very rigid in detecting the matches. For example, when you wouldn’t like to take into consideration all the small details like stop words, punctuations, capital letters etc., it’s better to use the Partial Ratio.

String1 = "Humpty Dumpty sat on a wall"String2 = "Humpty"
fuzz.partial_ratio("Humpty Dumpty sat on a wall", "Humpty")
>>> 100

As noticed, the only common word between both the strings was “Humpty”, yet it gave a 100% match.


When dealing with substrings, i.e., one short string being a part of some other long string, we use the Partial Ratio function. The mechanism of this ratio deals with something known as ‘optimal partial logic’. For an instance, let the shorter string length be ‘m’ and the longer string length be ’n’. Then, partial ratio finds a best-matching sub-string of length ‘m’.


Token Set Ratio


When you don’t care about the number of times a word in the string is repeated, then it is better to use the Token Set Ratio from the package.


String1 = "Humpty Dumpty sat on a wall"String2 = "Humpty Humpty Dumpty sat on a wall"
fuzz.token_set_ratio("Humpty Dumpty sat on a wall", "Humpty Humpty Dumpty sat on a wall")
>>> 100

As seen from the above example, the strings just differ in the number of times the word “Humpty” is used. Therefore, if we don’t intend to consider the repetition then Token Set Ratio gives a 100% match.

The similarity between given strings is an integer (int) measure ranging from [0 100]. The process of getting the similarity percentage, first involves splitting the strings into tokens (or words). Then takes place the sorting of these tokens.


The set of tokens are split up into two different sets: Intersection set and Remainder set. Furthermore, the token function removes all the punctuations by eliminating all non-alpha, non-numeric characters. Everything is converted into lowercase. The final match is the result of the similarities existing between the transformed strings (in form of tokens).


Token Sort Ratio


If the order in which the words are placed in a particular sentence doesn’t matter then the best way to match two strings is by the use of Token Sort Ratio from the package.


String1 = "Humpty Dumpty sat on a wall"
String2 = "Dumpty Humpty wall on sat a"
fuzz.token_sort_ratio("Humpty Dumpty sat on a wall","Dumpty Humpty wall on sat a")
>>> 100

As noticed from the above example, the strings only differed in the arrangement of words. Hence, the Token sort ratio gave a 100% match.



Instead of directly comparing the strings, Token Sort Ratio also splits the strings into tokens. The tokens (words) are then compared using the simple ratio mechanism. An interesting feature about this ratio is that it doesn’t take into consideration the order of the tokens appearing in the strings. However, the token sort ratio isn’t as flexible as token set ratio in terms of repetitions of words.


The above strings “Fat Cat” and “Cat Fat” give a 100% match with the Token Sort Ratio.


Now, what happens when you are not looking at just two strings, but dealing with a list of strings. To make things more clear, when a professor wants to run a plagiarism check, he won’t be looking at just one string/sentence. He has a bunch of strings (paragraphs) in a research paper. What happens then?


Now, let’s understand the process.extract function used behind matching a string to a dictionary (list of strings).


FIND BEST MATCHES IN A LIST OR DICTIONARY OF CHOICES


Installation for this is again easier, follow the code given below:


Consider, you want to match a string “Mango” to a list of values or dictionary that you already have. Look at the code given below to understand it better:


query = "Mango"choices = ['mango', 'go', 'an', 'Mango!', 'man', 'mang']
process.extract(query, choices)
>>> [('mango', 100), ('Mango!', 100), ('go', 90), ('an', 90), ('man', 90)]

Looking at the output it can be noticed that the input string “Mango” gives a 100% match with ‘mango’ and ‘Mango!’.


But, if you are just looking for one perfect match from a dictionary then follow the following lines of code:

query = "Mango"choices = ['mango', 'go', 'an', 'Mango!', 'man', 'mang']
process.extractOne(query, choices)
>>> ('mango', 100)

Next, what if you want to match a string to a list but with a specific type of ratio. We have a solution for that too! The lines of code given below helps you to get the similarity percentage with a certain type of ratios and also lets you set the limit on the number of matches needed.

query = "Mango"choices = ['Pogo', 'orange', 'apple', 'Mango!', 'fruits', 'Tango'])
process.extract(query, choices, scorer = fuzz.partial_ratio, limit = 2)
>>> [('Mango!', 100), ('Tango', 80)]

CONCLUSION — WHEN TO USE WHAT RATIO?


Although the above four types of fuzz ratios make our life easier by fetching the similar looking strings, it gets complicated while deciding the particular type of the ratios to be used for a given set of strings.


The usage of a particular type of fuzz ratio depends on various different factors like:


  1. Set of Values given for matching a string

  2. Background of the Data set provided

  3. Usage of the output results

  4. Type of strings being matched



  • From the above table, it can be noticed that from the rows 2 to 9, the mechanism of simple ratio can be understood really well. When there is a slight difference with the punctuations, cases of the letters, (Lower and Uppercase), change in the order of words etc., the Simple ratio gets affected greatly. As seen, the token set or token sort ratios show a 100% match because it doesn’t look at the strings directly, rather it only compares the tokes/words. Another noticeable feature is that, token set or sort ratios convert everything into a single case (lower), thereby not being case-sensitive. From the rows 6 to 9 it can be noticed that the token sort/set ratios also don’t consider the order of words in a string.

  • Looking at the rows 10 to 12, it can be seen that, only token set ratio gives a 100% match. Hence, only token set ratios neglect the change in the number of times a word is repeated in the string.

  • The rows from 13 to 18 are either missing words w.r.t the input string or they differ in the cases of the letters. Rows 17 and 18 just have one word matching to the input string, yet the partial ratio match for these strings is 100. It can be inferred from this that the partial ratio function only focuses on the best matching sub-string.

  • Another interesting feature that can be noticed from the above data set is that the only rows in which the Token Set ratio is less than 100 are 17, 19, 20 and 21. This happens because token set ratio only detects the misspelled tokens or words. Until and unless the words are correctly spelled, token set ratio gives a 100% match.

  • Row 20 depicts that whenever the string being compared to is empty, the similarity percentage is zero.


Hope this article enriched your knowledge about the most used python library — Fuzzywuzzy!


Author: Devyani Koradiya
2 views0 comments

Comments


bottom of page