Category: python

Initial Setup of a Ubuntu 20.04 VM with Compiled Python 3.10

Assuming you have logged in as root with a password

Creating a non-root User

adduser ubuntu
usermod -aG sudo ubuntu

Add your SSH key to the new User

ssh-copy-id ubuntu@
ssh ubuntu@

Enabling Firewall

sudo ufw app list
sudo ufw status

Disable root login and password based login

sudo vim /etc/ssh/sshd_config

Uncomment and set:

PasswordAuthentication no
PermitRootLogin no

Restart ssh:

sudo systemctl restart ssh.service

Set Timezone

sudo timedatectl list-timezones
sudo timedatectl set-timezone Africa/Johannesburg
sudo apt install ntp

Install nginx

sudo apt install -y nginx

Install required OS packages for Python 3.10

sudo apt install -y build-essential checkinstall
sudo apt install -y libreadline-gplv2-dev libncursesw5-dev libssl-dev libsqlite3-dev tk-dev libgdbm-dev libc6-dev libbz2-dev libffi-dev lzma

Get Python

cd /opt
sudo wget
sudo tar xzf Python-3.10.2.tgz
cd Python-3.10.2


sudo ./configure --enable-optimizations

Install alongside default system python3.8:

sudo make altinstall
python3.10 --version
Python 3.10.2

Install MySQL

sudo apt install -y mysql-server
sudo mysql_secure_installation

Python Mysqlclient

The mysqlclient python package requires

sudo apt install libmysqlclient-dev


Installing python3.9 on ubuntu 20.04 from source

I've found installing python from source on ubuntu just makes your life easier. Python depends on a few system binaries and linked libraries so you need to ensure they are present first.

sudo apt install software-properties-common build-essential \
libreadline-gplv2-dev libncursesw5-dev libssl-dev libsqlite3-dev \
tk-dev libgdbm-dev libc6-dev libbz2-dev libncurses-dev libgdbm-dev \
libpcap-dev libexpat1-dev libffi-dev liblzma-dev libgdbm-compat-dev

Get the latest tarball link from Linux/Unix

cd /opt
sudo wget
sudo tar xzf Python-3.9.1.tgz
#read the readme
cat README.rst

It will tell you what to do:

make test
sudo make install

Python3.8 is installed by to create a virtual environment use:

python3.9 -m venv env

Using FuzzyFuzzy to Match similar strings and Making Tweaks to Improve it

I have been scraping betting odds from a few websites in my spare time to decrease the time manually checking the odds on the different sites.

I've been using Fuzzywuzzy


I scrape an event name from a specific site:

name = Feyenoord Rotterdam - Wolfsberger AC

Now I want to match it with an existing event in the database.

I get a list of potencial events based on the sport and type - europa league football.

I get a list of options:

event_names = ['Ac Milan - Sparta Praha', 'Aek Athen - Leicester', 'Crvena Zvezd - Slovan Libe',
'Cska Moscow - Din Zagreb', 'Feyen Rotte - Wolfsberger', 'FK Qarabag - CF Villarreal',
'Gent - 1899 Hoffenh', 'Karabakh A - Villarreal', 'Lask Linz - Ludogo Raz', 'LASK Linz - PFC Ludogorets',
'Lille - Celtic', 'R Antwerp - Tottenham', 'Red Star Belgrade - FC Slovan Liberec',
'Sivasspor - M Tel Aviv', 'Zorya Lugan - Braga']

Then I extractOne:

match, level = process.extractOne(name, event_names)

The problem is it picks the incorrect option:

('Ac Milan - Sparta Praha', 86)

where it should choose:

Feyen Rotte - Wolfsberger

Fuzz Ratio vs Partial Ratio

fuzz.ratio() works well with short and long strings but not with string labels with 3 or 4 labels - which exactly the type of matching we need.

Here are a few tests done:

fuzz.ratio('Feyenoord Rotterdam - Wolfsberger AC', 'Ac Milan - Sparta Praha')

fuzz.ratio('Feyenoord Rotterdam - Wolfsberger AC', 'Feyen Rotte - Wolfsberger')

fuzz.partial_ratio('Feyenoord Rotterdam - Wolfsberger AC', 'Ac Milan - Sparta Praha')

fuzz.partial_ratio('Feyenoord Rotterdam - Wolfsberger AC', 'Feyen Rotte - Wolfsberger')

Looks like ratio works better than partial in this case - why was extractOne giving bad resutls?

When ordering is an issue the token_sort_ratio method is used. Not really an issue as the home team is usally stated first in all cases when it comes to sport.

fuzz.token_sort_ratio('Feyenoord Rotterdam - Wolfsberger AC', 'Ac Milan - Sparta Praha')

fuzz.token_sort_ratio('Feyenoord Rotterdam - Wolfsberger AC', 'Feyen Rotte - Wolfsberger')

Specifying the Scorer

Apparently the process.extract() and process.extractOne() methods let you specify a scorer.
The scorer I was using must not have defaulted to the one I needed:

# Default Extract
process.extract(name, all_names)
[('Ac Milan - Sparta Praha', 86),
 ('Feyen Rotte - Wolfsberger', 82),
 ('Red Star Belgrade - FC Slovan Liberec', 47),
 ('R Antwerp - Tottenham', 45),
 ('Zorya Lugan - Braga', 40)]

# Fuzz Ratio scorer
process.extract(name, all_names, scorer=fuzz.ratio)
[('Feyen Rotte - Wolfsberger', 82),
 ('Red Star Belgrade - FC Slovan Liberec', 47),
 ('Zorya Lugan - Braga', 40),
 ('Aek Athen - Leicester', 39),
 ('Crvena Zvezd - Slovan Libe', 39)]

# Partial Ratio scorer
process.extract(name, all_names, scorer=fuzz.partial_ratio)
[('Feyen Rotte - Wolfsberger', 80),
 ('Red Star Belgrade - FC Slovan Liberec', 47),
 ('Crvena Zvezd - Slovan Libe', 46),
 ('Aek Athen - Leicester', 43),
 ('R Antwerp - Tottenham', 43)]

process.extract(name, all_names, scorer=fuzz.token_sort_ratio)
[('Feyen Rotte - Wolfsberger', 81),
 ('R Antwerp - Tottenham', 42),
 ('Aek Athen - Leicester', 38),
 ('Red Star Belgrade - FC Slovan Liberec', 38),
 ('Ac Milan - Sparta Praha', 36)]

So it is clear I should set the scorer to fuzz.ratio():

process.extractOne(name, all_names, scorer=fuzz.ratio)
('Feyen Rotte - Wolfsberger', 82)

It is also worth noting that you probably want the threshold set to above 80

What is the default scorer?

If we look in the library

default_scorer = fuzz.WRatio

The w stands for weighted and this is the decription of the function:

# w is for weighted
def WRatio(s1, s2, force_ascii=True, full_process=True):
    Return a measure of the sequences' similarity between 0 and 100, using different algorithms.

    **Steps in the order they occur**

    #. Run full_process from utils on both strings
    #. Short circuit if this makes either string empty
    #. Take the ratio of the two processed strings (fuzz.ratio)
    #. Run checks to compare the length of the strings
        * If one of the strings is more than 1.5 times as long as the other
          use partial_ratio comparisons - scale partial results by 0.9
          (this makes sure only full results can return 100)
        * If one of the strings is over 8 times as long as the other
          instead scale by 0.6

    #. Run the other ratio functions
        * if using partial ratio functions call partial_ratio,
          partial_token_sort_ratio and partial_token_set_ratio
          scale all of these by the ratio based on length
        * otherwise call token_sort_ratio and token_set_ratio
        * all token based comparisons are scaled by 0.95
          (on top of any partial scalars)

    #. Take the highest value from these results
       round it and return it as an integer.

    :param s1:
    :param s2:
    :param force_ascii: Allow only ascii characters
    :type force_ascii: bool
    :full_process: Process inputs, used here to avoid double processing in extract functions (Default: True)

Hope this post helps you...

Oh also another library I found that may make your life easier so you don't even have to use fuzzywuzzy is recordlinker. That takes two seperate data sources and links them together...still need to check that out