Categories
sqlachemy

SQLAlchemy: Filtering a many-to-many on a field but keeping all results of that field

Going Back to Basics

In models.py:

"""
Basic Product Category models using reference from:
https://docs.sqlalchemy.org/en/14/orm/relationship_api.html#sqlalchemy.orm.relationship
"""

from sqlalchemy import Column, ForeignKey, Table, Integer, Boolean, String
from sqlalchemy.orm import declarative_base, relationship

Base = declarative_base()

category_products = Table(
    "category_products",
    Base.metadata,
    Column("product_id", ForeignKey("products.id"), primary_key=True),
    Column("category_id", ForeignKey("category.id"), primary_key=True),
)

class Products(Base):
    __tablename__ = "products"
    id = Column(Integer, primary_key=True)

    label = Column(String, nullable=False)
    is_active = Column(Boolean, default=True)

    # secondary specifies the intermediary table for many-to-many
    categories = relationship(
        "Category", secondary=category_products, back_populates="products"
    )

    primary_subcategory_id = Column("primary_subcategory", ForeignKey("category.id"))
    primary_subcategory = relationship("Category", back_populates="primary_products")

class Category(Base):
    __tablename__ = "category"
    id = Column(Integer, primary_key=True)

    is_active = Column(Boolean, default=True)
    label = Column(String(255), nullable=False)

    products = relationship(
        "Products", secondary=category_products, back_populates="categories"
    )
    primary_products = relationship("Products", back_populates="primary_subcategory")

In initialise_db.py:

import random

from sqlalchemy import create_engine
from sqlalchemy.orm import Session
from sqlalchemy.sql.expression import func

from models import Base, Category, Products

def get_engine():
    return create_engine(
        "postgresql://postgres:pass@127.0.0.1:5432/products?application_name=fixes",
        echo=True,
        future=True,
    )

if __name__ == "__main__":

    engine = get_engine()

    # Emit Create Table DDL
    Base.metadata.drop_all(engine)
    Base.metadata.create_all(engine)

    # Create categories
    with Session(engine) as session:
        for number in range(10):
            category_record = Category(
                is_active=True,
                label=number
            )
            session.add(category_record)
        session.commit()

    # Create products
    with Session(engine) as session:
        for number in range(10):
            product_record = Products(
                is_active=True,
                label=number
            )
            session.add(category_record)

            number_of_linked_items = random.randint(1, 3)

            # Link Products with categories
            for _ in range(number_of_linked_items):
                quantity = random.randint(2, 4)
                category_record = session.query(Category).order_by(func.random()).first()

                product_record.categories.append(category_record)

        session.commit()

To initialise this example:

  1. Have access to a postgres cluster
  2. Create a database
  3. Update the create_engine part of intialise_db.py
  4. Install requirements – I think that is pip install sqlalchemy
  5. Initialise the db: python initialise_db.py

Querying

What is required is to acquire all the products in a given category – along with their categories without additional N +1 queries.

To get started let the products along with their categories be selected.

In run_query.py:

from sqlalchemy import select
from sqlalchemy.orm import Session

from initialise_db import get_engine
from models import Products

if __name__ == "__main__":

    with Session(get_engine()) as session:
        query = select(Products).join(
            Products.categories
        )

        products = session.execute(query).scalars().all()

        for product in products:
            print('Product:', product.label)
            for category in product.categories:
                print('Category:', category.label)

The results:

$ python run_queries.py
Product: 0
Category: 3
Category: 7
Product: 0
Category: 3
Category: 7
Product: 1
Category: 8
Product: 2
Category: 1
Category: 4
Product: 2
Category: 1
Category: 4
Product: 3
Category: 0
Product: 4
Category: 4
Category: 5
Category: 7
Product: 4
Category: 4
Category: 5
Category: 7
Product: 4
Category: 4
Category: 5
Category: 7
Product: 5
Category: 8
Category: 9
Product: 5
Category: 8
Category: 9
Product: 6
Category: 5
Product: 7
Category: 0
Category: 9
Product: 7
Category: 0
Category: 9
Product: 8
Category: 7
Product: 9
Category: 8

You will notice that some products have been duplicated – the ones that have 2 or more categories linked. This is due to the query joining and returning 16 rows – for all related categories.

The N + 1 Solution

The meat of it is:

select(Products).join(
    Products.categories
)

This results in a query:

SELECT * 
FROM products
JOIN category_products AS category_products_1 ON products.id = category_products_1.product_id
JOIN category ON category.id = category_products_1.category_id

Then for each result of product, another query is made to get the categories (where 1 is the product id):

SELECT * FROM category, category_products 
WHERE 1 = category_products.product_id AND category.id = category_products.category_id

In total this runs 11 queries.

This is not ideal as when the product result set grows – the number of additional queries to run grows.

It is also not ideal in that it is seen that thre are 16 products when in our results only the unique results are wanted (the 10 products)

Fixing Uniqueness

To ensure that only unique products are returned in the loop sqlalchemy provides the unique() function that one can chain to a session.execute()

A note from the docs about unique():

When including joinedload() in reference to a one-to-many or many-to-many collection, the Result.unique() method must be applied to the returned result, which will uniquify the incoming rows by primary key that otherwise are multiplied out by the join. The ORM will raise an error if this is not present.

This is not automatic in modern SQLAlchemy, as it changes the behavior of the result set to return fewer ORM objects than the statement would normally return in terms of number of rows. Therefore SQLAlchemy keeps the use of Result.unique() explicit, so there’s no ambiguity that the returned objects are being uniqified on primary key.

products = session.execute(query).unique().scalars().all()

This fixes the results and makes them unique:

Product: 0
Category: 3
Category: 7
Product: 1
Category: 8
Product: 2
Category: 1
Category: 4
Product: 3
Category: 0
Product: 4
Category: 4
Category: 5
Category: 7
Product: 5
Category: 8
Category: 9
Product: 6
Category: 5
Product: 7
Category: 0
Category: 9
Product: 8
Category: 7
Product: 9
Category: 8

However the N + 1 problem is still there. Even though an explicit join is made with Product.categories they are not set when loading from the object: product.categories – a separate (lazy) query is made.

Fixing N + 1

SQLAlchemy has a special option called Joined Eager Loading that instead of lazily running a query when the data is required, it is known that the data is needed in all cases and one needs to tell sqlalchemy to load it up front.

from sqlalchemy.orm import joinedload

query = select(Products).options(
    joinedload(Products.categories)
)

products = session.execute(query).unique().scalars().all()

This results in SQL like:

SELECT *
FROM products
LEFT OUTER JOIN (
    category_products AS category_products_1 JOIN category AS category_1 ON category_1.id = category_products_1.category_id
) ON products.id = category_products_1.product_id

The LEFT OUTER JOIN is to allow for a lead object that does not refer to a related row. It can be changed to be an INNER JOIN.

This has solved the problem of efficiently getting the results of the many-to-many.
However how would one get the same result set with a filter or where condition on the category field?
Say we wanted all the products who was linked to category number 7 (id=8) – along with their categories?

Filtered Unique and No N+1

The products that are part of category 7 are:

Product: 0
Category: 3
Category: 7
Product: 4
Category: 4
Category: 5
Category: 7
Product: 8
Category: 7

So that should be what is seen once the query is formulated.

Let us step back though and do a filter on the original result set:

What does SQLalchemy say?

SQLALchemy in what kind of loading to do? says for one-to-many or many-to-many then selectinload() is generally the best loading method.

For many-to-one, joinedload() is the best.

select(Products).join(
    Products.categories
).where(
    Category.id == 8
)

Runs the query:

SELECT products.primary_subcategory, products.id, products.label, products.is_active 
    FROM products JOIN category_products AS category_products_1 ON products.id = category_products_1.product_id JOIN category ON category.id = category_products_1.category_id 
    WHERE category.id = 8

This query returns 3 records – the 3 correct products: 0, 4 and 8. However it does not include all the linked categories for the items – it does extra queries for those returned (lazily).

Instead a joinedload() will be used:

select(Products).options(
    joinedload(Products.categories)
).where(
    Category.id == 8
)

returns SQL:

SELECT *
FROM category, products
LEFT OUTER JOIN (
    category_products AS category_products_1
    JOIN category AS category_1 ON category_1.id = category_products_1.category_id
)
ON products.id = category_products_1.product_id 
WHERE category.id = 8

This returns 16 records. The results displayed show all the records. So it appears the filtering is not taking place.

If the normal join and the joinedload are combined:

select(Products).join(
    Products.categories  
).options(
    joinedload(Products.categories)
).where(
    Category.id == 8
)

this results in:

SELECT *
FROM products
JOIN category_products AS category_products_1 ON products.id = category_products_1.product_id
JOIN category ON category.id = category_products_1.category_id
LEFT OUTER JOIN (
    category_products AS category_products_2
    JOIN category AS category_1 ON category_1.id = category_products_2.category_id
) ON products.id = category_products_2.product_id 
WHERE category.id = 8;

Which returns 6 rows. Along with the correct results:

Product: 0
Category: 3
Category: 7
Product: 4
Category: 4
Category: 5
Category: 7
Product: 8
Category: 7

which matches what we were expecting.

Job done.

SelectInLoad

As mentioned above the recommended technique is selectinload over joinedload.
Select In Load will first query the one table with the conditions and then use an IN in another query to select related fields.
2 or more queries are made.

Tried it with:

select(Products).join(
    Products.categories  
).options(
    selectinload(Products.categories)
).where(
    Category.id == 8
)

that emitted SQL:

-- query 1
SELECT * 
FROM products JOIN category_products AS category_products_1 ON products.id = category_products_1.product_id
JOIN category ON category.id = category_products_1.category_id 
WHERE category.id = 8

-- query 2
SELECT * 
FROM products AS products_1
JOIN category_products AS category_products_1 ON products_1.id = category_products_1.product_id
JOIN category ON category.id = category_products_1.category_id 
WHERE products_1.id IN (1,5,9);

Using EXPLAIN (ANALYZE, BUFFERS) <the query> the output was compared:

joinedload:

  • "Total Cost": 96.82
  • "Planning Time": 0.541
  • "Execution Time": 0.250

selectinload:

query 1

  • "Total Cost": 69.10
  • "Planning Time": 0.360
  • "Execution Time": 0.212

query 2

  • "Total Cost": 51.39,
  • "Planning Time": 0.388
  • "Execution Time": 0.140

Hard to say what is more efficient- with a differing data set size. In this case however, joinedload wins.

Adding More Conditions

You might have seen earlier that the Products and Category models have a field called is_active.
Up to now this field has not been queries, however now we want to ensure that only active products and active categories are returned.

We will be setting Product with label 8 (id:9) as inactive and Category with label 5 (id=6) as inactive.
So the results should look like this:

Product: 0
Category: 3
Category: 7
Product: 4
Category: 4
Category: 7

How can this be achieved?

First for the products being active a condition was added:

query = select(Products).join(
    Products.categories  
).options(
    selectinload(Products.categories)
).where(
    Category.id == 8,
    Products.is_active == True
)

returns SQL:

-- query 1
SELECT *
FROM products
JOIN category_products AS category_products_1 ON products.id = category_products_1.product_id
JOIN category ON category.id = category_products_1.category_id 
WHERE category.id = 8 AND products.is_active = true;

-- query 2
SELECT *
FROM products AS products_1
JOIN category_products AS category_products_1 ON products_1.id = category_products_1.product_id
JOIN category ON category.id = category_products_1.category_id 
WHERE products_1.id IN (1,5)

and results:

Product: 0
Category: 7
Category: 3
Product: 4
Category: 4
Category: 7
Category: 5

Looking good, now to exclude the categories…

Reading through some info – it is not possible to add a filter/where condition to a selectinload or joinedload as the alias is not available to the caller. The solution appears to be making use of the contains_eager and aliasing.

contains_eager is difficult to understand. Was looking around and saw this issue on extra criteria and saw one of the examples use .and_(<some condition>))

Tried that:

select(Products).join(
    Products.categories  
).options(
    joinedload(Products.categories.and_(Category.is_active == True))
).where(
    Category.id == 8,
    Products.is_active == True
)

which gives exactly the query we are after (adding the condition to the join):

SELECT *
FROM products
JOIN category_products AS category_products_1 ON products.id = category_products_1.product_id
JOIN category ON category.id = category_products_1.category_id
LEFT OUTER JOIN (category_products AS category_products_2
JOIN category AS category_1 ON category_1.id = category_products_2.category_id AND category_1.is_active = true) ON products.id = category_products_2.product_id 
WHERE category.id = 8 AND products.is_active = true;

of with select in load:

select(Products).join(
    Products.categories  
).options(
    selectinload(Products.categories.and_(Category.is_active == True))
).where(
    Category.id == 8,
    Products.is_active == True
)

Results in these 2 queries:

SELECT *
FROM products
JOIN category_products AS category_products_1 ON products.id = category_products_1.product_id
JOIN category ON category.id = category_products_1.category_id 
WHERE category.id = %(id_1)s AND products.is_active = true

and

SELECT *
FROM products AS products_1
JOIN category_products AS category_products_1 ON products_1.id = category_products_1.product_id
JOIN category ON category.id = category_products_1.category_id AND category.is_active = true
WHERE products_1.id IN (1, 5)

And the correct results:

Product: 0
Category: 7
Category: 3
Product: 4
Category: 4
Category: 7

a bit of work, reading, testing and understanding needed when working with sqlalchemy…

From tests the joinedload performs better

Caveat

Found a caveat where if the product is not linked to a category at all…it will not show up.

In this case an outer join is wanted:

select(Products).join(
    Products.categories, isouter=True
).options(
    joinedload(Products.categories.and_(Category.is_active == True))
).where(
    Category.id == 8,
    Products.is_active == True
)

Sources