Category: python

Prerequisite Packages and Compiling Python 3 on CentOS

What are the prerequisite packages for a complete python3 compilation install?

You will always get issues like Pip not being able to access pypi because the openssl module was not installed. Other things need the gcc compiler and such.

Recently I got this warning:

Could not import the lzma module. Your installed Python is incomplete. Attempting to use lzma compression will result in a RuntimeError.

It is very annoying

Install Prerequisites

yum groupinstall development
yum install zlib-devel gcc openssl-devel bzip2-devel libffi-devel xz-devel ncurses-devel sqlite-devel readline-devel tk-devel gdbm-devel db4-devel libpcap-devel expat-devel

Compile python

cd /opt
curl -O https://www.python.org/ftp/python/3.8.3/Python-3.8.3.tgz
tar xzf Python-3.8.3.tgz
./configure
make
sudo make install

More Issues

There may be a warning after running make:

The necessary bits to build these optional modules were not found:
_curses               _curses_panel         _dbm               
_gdbm                 _sqlite3              _tkinter           
_uuid                 readline     

The following modules found by detect_modules() in setup.py, have been
built by the Makefile instead, as configured by the Setup files:
_abc                  atexit                pwd                
time                                                           

This post mentions that some more are required (which I retrospectively added above):

    sudo yum install yum-utils
    sudo yum groupinstall development
    # Libraries needed during compilation to enable all features of Python:
    sudo yum install zlib-devel bzip2-devel openssl-devel ncurses-devel sqlite-devel readline-devel tk-devel gdbm-devel db4-devel libpcap-devel xz-devel expat-devel

Even after that I still get:

    The following modules found by detect_modules() in setup.py, have been
    built by the Makefile instead, as configured by the Setup files:
    _abc                  atexit                pwd                
    time   

HTTPX: An open stream object is being garbage collected; call “stream.close()” explicitly.

When using HTTPX I get the following error printed sometimes:

An open stream object is being garbage collected; call "stream.close()" explicitly.

According to these two github issues:

This is a Python 3.8.0 issue.

I upgraded to Python 3.8.2.

It fixes the issue!

How to speed up http calls in python? With Examples

Blocking HTTP requests

The most popular and easy to use blocking http client is requests. The requests docs are simple and straight forward...for humans.

The biggest performance gain you can acquire (provided you are making requests to a single host) is using an http session. This creates a persistent connection, meaning that additional requests will use the existing session. More info on that in this blog post on python and fast http clients.

Example Blocking HTTP persistent connection vs new connections

Here is some example code for getting quotes from quotes.rest

import requests
import time

def get_sites(sites):
    data = []

    session = requests.Session()

    for site in sites:
        response = session.get(site)
        data.append(response.json())

    return data

if __name__ == '__main__':
    categories = ["inspire", "management", "sports", "life", "funny", "love", "art", "students"]

    sites = [
        f'https://quotes.rest/qod?category={category}' for category in categories
    ]

    start_time = time.time()
    data = get_sites(sites)
    duration = time.time() - start_time
    print(f"Downloaded {len(sites)} sites in {duration} seconds")

Don't overuse this API as they have rate limits and will eventually give you a 429 http status code as a response - Too Many Requests

So when I run this code:

Downloaded 8 sites in 3.7488651275634766 seconds

That is pretty fast, but what would be the case if I used requests.get() instead of using the session?

In that case the result was:

Downloaded 8 sites in 10.602024793624878 seconds

So in the first example resusing the existing HTTP connection was 2.8 times faster.

Threaded HTTP Requests

There is a library that uses requests apparently called requests_futures that uses threads - preemptive multithreading.

Example Threaded Request Futures

from concurrent.futures import as_completed
from requests_futures import sessions
import time

def get_sites(sites):
    data = []

    with sessions.FuturesSession() as session:
        futures = [session.get(site) for site in sites]
        for future in as_completed(futures):
            resp = future.result()
            data.append(resp.json())

    return data

if __name__ == '__main__':
    categories = ["inspire", "management", "sports", "life", "funny", "love", "art", "students"]

    sites = [
        f'https://quotes.rest/qod?category={category}' for category in categories
    ]

    start_time = time.time()
    data = get_sites(sites)
    duration = time.time() - start_time
    print(f"Downloaded {len(sites)} sites in {duration} seconds")

When running this code it was faster:

Downloaded 8 sites in 1.4970569610595703 seconds

Interestingly if I set the max workers to 8 sessions.FuturesSession(max_workers=8), it slows it down dramatically;

Downloaded 8 sites in 5.838595867156982 seconds

Anyway the threaded requests is 7 times faster than non-persistent blocking http and 2.5 times fast than persistent blocking http.

Asynchronous HTTP Requests

The next thing to look at is co-operative multitasking, which still uses a single thread (and single process) but will give control of execution back to the event loop once's it is done - it won't block.

Python has a few aync http libraries: aiohttpand httpx

Example Async Aiohttp

from aiohttp import ClientSession
import asyncio
import time

async def get_sites(sites):
    tasks = [asyncio.create_task(fetch_site(s)) for s in sites] 
    return await asyncio.gather(*tasks)  

async def fetch_site(url):
    async with ClientSession() as session:
        async with session.get(url) as resp:  
            data = await resp.json()
    return data

if __name__ == '__main__':
    categories = ["inspire", "management", "sports", "life", "funny", "love", "art", "students"]

    sites = [
        f'https://quotes.rest/qod?category={category}' for category in categories
    ]

    start_time = time.time()
    data = asyncio.run(get_sites(sites))
    duration = time.time() - start_time
    print(f"Downloaded {len(sites)} sites in {duration} seconds")

The result of this code was:

Downloaded 8 sites in 1.271439790725708 seconds

That is the fastest response we have had yet. More than 8 times faster than the non-persistent blocking HTTP connection, almost 3 times faster than the persistent blocking HTTP connection.
Also 17% Faster than the threaded blocking HTTP requests.

Thoughts on Aiohttp

The problem with aiohttp as rogouelynn mentions in a blog post is everything needs to be async

In real life scenarios you often need to do some syncronous stuff first, like authenticating and receiving a token.

You can't just do:

session = ClientSession()
response = session.get('https://iol.co.za') 

>>> response
<aiohttp.client._RequestContextManager at 0x102985c80>

As you only get back a context manager.

Potencially an easier to use library is httpx because syncronous requests are as native and easy to do as asynchronous requests.

r = httpx.get('https://httpbin.org/get')

Putting it all together

How to speed up http calls in python...well go through the steps until you get the speed you need.

  1. Use simple blocking persistent HTTP connections with requests.Session()
  2. Use an Asynchronous http client like asyncio or httpx.

In the steps above I skip over the threading part as you will find that when you scale up threading can become unreliable and it is usually the case where asyn is better or matches threading performance.

python-http-client-speed-comparison