Category: python

Using FuzzyFuzzy to Match similar strings and Making Tweaks to Improve it

I have been scraping betting odds from a few websites in my spare time to decrease the time manually checking the odds on the different sites.

I've been using Fuzzywuzzy

Example

I scrape an event name from a specific site:

name = Feyenoord Rotterdam - Wolfsberger AC

Now I want to match it with an existing event in the database.

I get a list of potencial events based on the sport and type - europa league football.

I get a list of options:

event_names = ['Ac Milan - Sparta Praha', 'Aek Athen - Leicester', 'Crvena Zvezd - Slovan Libe',
'Cska Moscow - Din Zagreb', 'Feyen Rotte - Wolfsberger', 'FK Qarabag - CF Villarreal',
'Gent - 1899 Hoffenh', 'Karabakh A - Villarreal', 'Lask Linz - Ludogo Raz', 'LASK Linz - PFC Ludogorets',
'Lille - Celtic', 'R Antwerp - Tottenham', 'Red Star Belgrade - FC Slovan Liberec',
'Sivasspor - M Tel Aviv', 'Zorya Lugan - Braga']

Then I extractOne:

match, level = process.extractOne(name, event_names)

The problem is it picks the incorrect option:

('Ac Milan - Sparta Praha', 86)

where it should choose:

Feyen Rotte - Wolfsberger

Fuzz Ratio vs Partial Ratio

fuzz.ratio() works well with short and long strings but not with string labels with 3 or 4 labels - which exactly the type of matching we need.

Here are a few tests done:

fuzz.ratio('Feyenoord Rotterdam - Wolfsberger AC', 'Ac Milan - Sparta Praha')
24

fuzz.ratio('Feyenoord Rotterdam - Wolfsberger AC', 'Feyen Rotte - Wolfsberger')
82

fuzz.partial_ratio('Feyenoord Rotterdam - Wolfsberger AC', 'Ac Milan - Sparta Praha')
26

fuzz.partial_ratio('Feyenoord Rotterdam - Wolfsberger AC', 'Feyen Rotte - Wolfsberger')
80

Looks like ratio works better than partial in this case - why was extractOne giving bad resutls?

When ordering is an issue the token_sort_ratio method is used. Not really an issue as the home team is usally stated first in all cases when it comes to sport.

fuzz.token_sort_ratio('Feyenoord Rotterdam - Wolfsberger AC', 'Ac Milan - Sparta Praha')
36

fuzz.token_sort_ratio('Feyenoord Rotterdam - Wolfsberger AC', 'Feyen Rotte - Wolfsberger')
81

Specifying the Scorer

Apparently the process.extract() and process.extractOne() methods let you specify a scorer.
The scorer I was using must not have defaulted to the one I needed:

# Default Extract
process.extract(name, all_names)
[('Ac Milan - Sparta Praha', 86),
 ('Feyen Rotte - Wolfsberger', 82),
 ('Red Star Belgrade - FC Slovan Liberec', 47),
 ('R Antwerp - Tottenham', 45),
 ('Zorya Lugan - Braga', 40)]

# Fuzz Ratio scorer
process.extract(name, all_names, scorer=fuzz.ratio)
[('Feyen Rotte - Wolfsberger', 82),
 ('Red Star Belgrade - FC Slovan Liberec', 47),
 ('Zorya Lugan - Braga', 40),
 ('Aek Athen - Leicester', 39),
 ('Crvena Zvezd - Slovan Libe', 39)]

# Partial Ratio scorer
process.extract(name, all_names, scorer=fuzz.partial_ratio)
[('Feyen Rotte - Wolfsberger', 80),
 ('Red Star Belgrade - FC Slovan Liberec', 47),
 ('Crvena Zvezd - Slovan Libe', 46),
 ('Aek Athen - Leicester', 43),
 ('R Antwerp - Tottenham', 43)]

process.extract(name, all_names, scorer=fuzz.token_sort_ratio)
[('Feyen Rotte - Wolfsberger', 81),
 ('R Antwerp - Tottenham', 42),
 ('Aek Athen - Leicester', 38),
 ('Red Star Belgrade - FC Slovan Liberec', 38),
 ('Ac Milan - Sparta Praha', 36)]

So it is clear I should set the scorer to fuzz.ratio():

process.extractOne(name, all_names, scorer=fuzz.ratio)
('Feyen Rotte - Wolfsberger', 82)

It is also worth noting that you probably want the threshold set to above 80

What is the default scorer?

If we look in the library process.py:

default_scorer = fuzz.WRatio

The w stands for weighted and this is the decription of the function:

# w is for weighted
def WRatio(s1, s2, force_ascii=True, full_process=True):
    """
    Return a measure of the sequences' similarity between 0 and 100, using different algorithms.

    **Steps in the order they occur**

    #. Run full_process from utils on both strings
    #. Short circuit if this makes either string empty
    #. Take the ratio of the two processed strings (fuzz.ratio)
    #. Run checks to compare the length of the strings
        * If one of the strings is more than 1.5 times as long as the other
          use partial_ratio comparisons - scale partial results by 0.9
          (this makes sure only full results can return 100)
        * If one of the strings is over 8 times as long as the other
          instead scale by 0.6

    #. Run the other ratio functions
        * if using partial ratio functions call partial_ratio,
          partial_token_sort_ratio and partial_token_set_ratio
          scale all of these by the ratio based on length
        * otherwise call token_sort_ratio and token_set_ratio
        * all token based comparisons are scaled by 0.95
          (on top of any partial scalars)

    #. Take the highest value from these results
       round it and return it as an integer.

    :param s1:
    :param s2:
    :param force_ascii: Allow only ascii characters
    :type force_ascii: bool
    :full_process: Process inputs, used here to avoid double processing in extract functions (Default: True)
    :return:
    """

Hope this post helps you...

Oh also another library I found that may make your life easier so you don't even have to use fuzzywuzzy is recordlinker. That takes two seperate data sources and links them together...still need to check that out

Sources

Using django-oauth-toolkit for Client credentials Oauth Flow

I've been wanting to secure my api - so unidentified and unathorized parties cannot view, update, create or delete data.
This api is internal to the company and will only be used by other services - in other words no end users.
Hence the delegation of authorization need not happen and the services will be authneticating directly with the api.

That is why the Oauth client credentials flow is used - it is for server to server communication. (As far as I know)

There is alot of conflicting information on Oauth but in the RFC6749 on Oauth 2 Client credentials is mentioned:

1.3.4.  Client Credentials

   The client credentials (or other forms of client authentication) can
   be used as an authorization grant when the authorization scope is
   limited to the protected resources under the control of the client,
   or to protected resources previously arranged with the authorization
   server.  Client credentials are used as an authorization grant
   typically when the client is acting on its own behalf (the client is
   also the resource owner) or is requesting access to protected
   resources based on an authorization previously arranged with the
   authorization server.

Nordic API's: Securing the API Stronghold book mentions:

Oauth: It’s for delegation, and delegation only

I agree except when the client is the resource owner in the client credentials instance.
In that case surely there is no delegation?

Should we use it

What is the advantage over a basic auth or token authentication method?

It seems to just be an added step for the client but the key is that the token expires. So if a bad actor gets our token it will not last long before it is of no use.
The client id and secret is the thing that is used to generate tokend for future calling of the api.

Difference between Resource Owner Password Based flow and client Credentials

Django-oauth-tollkit provides both and their example uses the resource owner password based flow.
In both cases the resource owner is the client - so there is no delegation.

So what is the difference?

I checked on stackoverflow, and it turns out I was wrong.

In the resource owner client based way, the resource owner (end user) trusts the client application enough to give it it's username and password.
We don't really want this.

Implementing Client Credentials flow

Since users are not going to use the API and only services/clients will, I want to disable the other authorization flows and disable registering of clients.

I will manage the clients and they will be the resource owners.

So if you follow the information in the django-oauth-toolkit and setting it up for client credentials that should help

Permissions are significantly different from Django Permissions

What I found out durinng testing is that OauthToolkit implements it's own seperate permissions. So if you were wanting to use django model permissions (add, change, view and delete), you don't be able to.

Wait...I spoke too fast.

You can allow this with:

permission_classes = [IsAuthenticatedOrTokenHasScope, DjangoModelPermission]

However that means that you actually have to test with scopes if you expect a client to use it with Oauth and not django auth.

This is the view to use ClientProtectedResourceView

Prerequisite Packages and Compiling Python 3 on CentOS

What are the prerequisite packages for a complete python3 compilation install?

You will always get issues like Pip not being able to access pypi because the openssl module was not installed. Other things need the gcc compiler and such.

Recently I got this warning:

Could not import the lzma module. Your installed Python is incomplete. Attempting to use lzma compression will result in a RuntimeError.

It is very annoying

Install Prerequisites

yum groupinstall development
yum install zlib-devel gcc openssl-devel bzip2-devel libffi-devel xz-devel ncurses-devel sqlite-devel readline-devel tk-devel gdbm-devel db4-devel libpcap-devel expat-devel

Compile python

cd /opt
curl -O https://www.python.org/ftp/python/3.8.3/Python-3.8.3.tgz
tar xzf Python-3.8.3.tgz
./configure
make
sudo make install

More Issues

There may be a warning after running make:

The necessary bits to build these optional modules were not found:
_curses               _curses_panel         _dbm               
_gdbm                 _sqlite3              _tkinter           
_uuid                 readline     

The following modules found by detect_modules() in setup.py, have been
built by the Makefile instead, as configured by the Setup files:
_abc                  atexit                pwd                
time                                                           

This post mentions that some more are required (which I retrospectively added above):

    sudo yum install yum-utils
    sudo yum groupinstall development
    # Libraries needed during compilation to enable all features of Python:
    sudo yum install zlib-devel bzip2-devel openssl-devel ncurses-devel sqlite-devel readline-devel tk-devel gdbm-devel db4-devel libpcap-devel xz-devel expat-devel

Even after that I still get:

    The following modules found by detect_modules() in setup.py, have been
    built by the Makefile instead, as configured by the Setup files:
    _abc                  atexit                pwd                
    time