Categories
python

Using FuzzyFuzzy to Match similar strings and Making Tweaks to Improve it

I have been scraping betting odds from a few websites in my spare time to decrease the time manually checking the odds on the different sites.

I’ve been using Fuzzywuzzy

Example

I scrape an event name from a specific site:

name = Feyenoord Rotterdam - Wolfsberger AC

Now I want to match it with an existing event in the database.

I get a list of potencial events based on the sport and type – europa league football.

I get a list of options:

event_names = ['Ac Milan - Sparta Praha', 'Aek Athen - Leicester', 'Crvena Zvezd - Slovan Libe',
'Cska Moscow - Din Zagreb', 'Feyen Rotte - Wolfsberger', 'FK Qarabag - CF Villarreal',
'Gent - 1899 Hoffenh', 'Karabakh A - Villarreal', 'Lask Linz - Ludogo Raz', 'LASK Linz - PFC Ludogorets',
'Lille - Celtic', 'R Antwerp - Tottenham', 'Red Star Belgrade - FC Slovan Liberec',
'Sivasspor - M Tel Aviv', 'Zorya Lugan - Braga']

Then I extractOne:

match, level = process.extractOne(name, event_names)

The problem is it picks the incorrect option:

('Ac Milan - Sparta Praha', 86)

where it should choose:

Feyen Rotte - Wolfsberger

Fuzz Ratio vs Partial Ratio

fuzz.ratio() works well with short and long strings but not with string labels with 3 or 4 labels – which exactly the type of matching we need.

Here are a few tests done:

fuzz.ratio('Feyenoord Rotterdam - Wolfsberger AC', 'Ac Milan - Sparta Praha')
24

fuzz.ratio('Feyenoord Rotterdam - Wolfsberger AC', 'Feyen Rotte - Wolfsberger')
82

fuzz.partial_ratio('Feyenoord Rotterdam - Wolfsberger AC', 'Ac Milan - Sparta Praha')
26

fuzz.partial_ratio('Feyenoord Rotterdam - Wolfsberger AC', 'Feyen Rotte - Wolfsberger')
80

Looks like ratio works better than partial in this case – why was extractOne giving bad resutls?

When ordering is an issue the token_sort_ratio method is used. Not really an issue as the home team is usally stated first in all cases when it comes to sport.

fuzz.token_sort_ratio('Feyenoord Rotterdam - Wolfsberger AC', 'Ac Milan - Sparta Praha')
36

fuzz.token_sort_ratio('Feyenoord Rotterdam - Wolfsberger AC', 'Feyen Rotte - Wolfsberger')
81

Specifying the Scorer

Apparently the process.extract() and process.extractOne() methods let you specify a scorer.
The scorer I was using must not have defaulted to the one I needed:

# Default Extract
process.extract(name, all_names)
[('Ac Milan - Sparta Praha', 86),
 ('Feyen Rotte - Wolfsberger', 82),
 ('Red Star Belgrade - FC Slovan Liberec', 47),
 ('R Antwerp - Tottenham', 45),
 ('Zorya Lugan - Braga', 40)]

# Fuzz Ratio scorer
process.extract(name, all_names, scorer=fuzz.ratio)
[('Feyen Rotte - Wolfsberger', 82),
 ('Red Star Belgrade - FC Slovan Liberec', 47),
 ('Zorya Lugan - Braga', 40),
 ('Aek Athen - Leicester', 39),
 ('Crvena Zvezd - Slovan Libe', 39)]

# Partial Ratio scorer
process.extract(name, all_names, scorer=fuzz.partial_ratio)
[('Feyen Rotte - Wolfsberger', 80),
 ('Red Star Belgrade - FC Slovan Liberec', 47),
 ('Crvena Zvezd - Slovan Libe', 46),
 ('Aek Athen - Leicester', 43),
 ('R Antwerp - Tottenham', 43)]

process.extract(name, all_names, scorer=fuzz.token_sort_ratio)
[('Feyen Rotte - Wolfsberger', 81),
 ('R Antwerp - Tottenham', 42),
 ('Aek Athen - Leicester', 38),
 ('Red Star Belgrade - FC Slovan Liberec', 38),
 ('Ac Milan - Sparta Praha', 36)]

So it is clear I should set the scorer to fuzz.ratio():

process.extractOne(name, all_names, scorer=fuzz.ratio)
('Feyen Rotte - Wolfsberger', 82)

It is also worth noting that you probably want the threshold set to above 80

What is the default scorer?

If we look in the library process.py:

default_scorer = fuzz.WRatio

The w stands for weighted and this is the decription of the function:

# w is for weighted
def WRatio(s1, s2, force_ascii=True, full_process=True):
    """
    Return a measure of the sequences' similarity between 0 and 100, using different algorithms.

    **Steps in the order they occur**

    #. Run full_process from utils on both strings
    #. Short circuit if this makes either string empty
    #. Take the ratio of the two processed strings (fuzz.ratio)
    #. Run checks to compare the length of the strings
        * If one of the strings is more than 1.5 times as long as the other
          use partial_ratio comparisons - scale partial results by 0.9
          (this makes sure only full results can return 100)
        * If one of the strings is over 8 times as long as the other
          instead scale by 0.6

    #. Run the other ratio functions
        * if using partial ratio functions call partial_ratio,
          partial_token_sort_ratio and partial_token_set_ratio
          scale all of these by the ratio based on length
        * otherwise call token_sort_ratio and token_set_ratio
        * all token based comparisons are scaled by 0.95
          (on top of any partial scalars)

    #. Take the highest value from these results
       round it and return it as an integer.

    :param s1:
    :param s2:
    :param force_ascii: Allow only ascii characters
    :type force_ascii: bool
    :full_process: Process inputs, used here to avoid double processing in extract functions (Default: True)
    :return:
    """

Hope this post helps you…

Oh also another library I found that may make your life easier so you don’t even have to use fuzzywuzzy is recordlinker. That takes two seperate data sources and links them together…still need to check that out

Sources