Enhance Your Routine DB Activities With Django's ORM

July 21, 2020 by Susmitha Gudapati
Enhance Your Routine DB Activities With Django's ORM

Database transactions have never been trivial in our groove of building efficient web applications. They play a major role - dictating performance almost single-handedly in data-intensive applications. There are, however, some nearly hidden approaches that aren't part of our routine development yet. These could enhance our interactions with the database and, inwardly, give us the chance to unleash all the power of Django's ORM. I had a chance to play around with these simple yet interesting methods recently and made some interesting discoveries. Let's dive in!

This is the sample model schema I will be referring to in this blog -

class User(models.Model):
    name = models.CharField(max_length=64, null=True)

class Category(models.Model):
    name = models.CharField(max_length=128)
    is_active = models.BoolenField(default=False)
    created_by = models.ForeignKey(
        User, on_delete=models.PROTECT, related_name='category_created_by', null=True)
    
class ProductAttribute(models.Model):
    name = models.CharField(max_length=64)

class Product(models.Model):
    name = models.CharField(max_length=128, db_index=True)
    upc = models.CharField(max_length=32, unique=True)
    stock = models.PositiveIntegerField(default=0)
    
    category = models.ForeignKey(
        Category, on_delete=models.PROTECT, related_name='categories', null=True)
    
    attributes = models.ManyToManyField(ProductAttribute)
    
    created_by = models.ForeignKey(
        User, on_delete=models.PROTECT, related_name='product_created_by', null=True)
    
    def __str__(self):
        return "%s (%s)" % (self.name, ", ".join(attribute.name for attribute in self.attributes.all())

Profiling Query Operations First

How do you profile querysets in the application? Firstly, there are a couple of tools available which provide insights on how long it takes for each queryset to execute and whether the query triggered is efficient enough or not. django-debug-toolbar is one such tool that is profusely used for this purpose. django-silk and django-live-profiling are two other tools that are great for profiling.

The other way to check on querysets and their execution time is through the connection module.

from django.db import connection

Product.objects.all()
<ProductQuerySet [<Product: The shellcoder's handbook>, <Product: Hacking Exposed Wireless>, <Product: Coders at Work>, 
  <Product: Visual Guide to Lock Picking>, <Product: Studyguide for Counter Hack Reloaded>, <Product: Gray Hat Hacking>,
  <Product: The Girl Who Played with Non-Fire>, <Product: Reversing>, <Product: Cyberpunk>, <Product: Applied cryptography>,
  <Product: The Girl Who Kicked the Hornet's Nest>, <Product: Hacker's Delight>, <Product: Google Hacking>, 
  <Product: Silence On The Wire>, <Product: Hacking Work>, <Product: The Girl with the Dragon Tattoo>, <Product: Metasploit>, 
  <Product: We Are Anonymous>, <Product: Social Engineering>, <Product: The Cathedral & the Bazaar>, '...(remaining elements truncated)...']>

>>> Product.objects.count()
200

connection.queries
[{'sql': 'SELECT "catalogue_product"."id", "catalogue_product"."upc", "catalogue_product"."name", "catalogue_product"."created_by",
  FROM "catalogue_product", 'time': '0.014'},
 {'sql': 'SELECT COUNT(*) AS "__count" FROM "catalogue_product"', 'time': '0.000'}]

connection.queries returns a list of dictionaries with raw SQL queries running in Django along with their execution time. This helps in understanding what precisely the queries are doing, and what they are costing.

There is also a method called QuerySet.explain() which is applied to an individual queryset to understand its metrics. The detailed explanation composed by this method of how a database executes a query gives you a rough idea of the efficacy of the queryset. Using the explain functionality to understand if querysets have to be optimized or if indexes are to be added is often a good technique to facilitate performance.

# Please note that results may vary depending on the database you're using.

Product.objects.all().explain()
'Seq Scan on catalogue_product  (cost=0.00..06.09 rows=200 width=10)'

Product.objects.all().order_by("name").explain()
'Sort  (cost=10.14..10.67 rows=200 width=10)\n  Sort Key: name\n  ->  Seq Scan on catalogue_product  (cost=0.00..06.09 rows=200 width=10)'

Understanding QuerySets

The interesting trait of querysets in Django is that operations like creating, filtering, and slicing don't require any database activity. Moreover, despite the number of operations chained to the queryset, the query isn't executed until it's evaluated. Because of this tendency, querysets in Django are known as 'lazy querysets'.

# instantiates a queryset and adds a filter to it,
# but it do not require any db activity.
products = Product.objects.filter(name__startswith="sports")

# chaining filters to the above which also don't require interactions with db.
products = products.exclude(attributes__name__in=["neon", "metallic"])
products = products.filter(category__name="shoes")

# connects with db to evaluate the query and prints results.
print(products)

The results of the queryset are only fetched when they are asked for in Django. For example, there are three queries in the above code. Though it seems like this particular block hits the database thrice, it actually executes the query and connects with the database only once - at the print(products) statement.

Caching in QuerySets

Django caches the whole queryset and all the results of non-callable attributes of the ORM once the queryset is executed. When a queryset is instantiated, the cache is empty. The first time the queryset is evaluated, the results are fetched from the database and saved in the queryset's cache. Now, further evaluations of this queryset are served from the cache.

So, every queryset contains a cache to minimise the number of database connections. Being aware of the caching mechanisms while building logic and carefully triggering attributes from templates improves performance by avoiding numerous database connections.

# instantiates a queryset and cache is empty.
product = Product.objects.get(id=5)

# Category object is retrieved from the database at this point 
# and the results are saved in queryset's cache.
product.category

# retrieved from the cached results.
product.category


# callable attributes queries database every sigle time
product.categories.all()  # query performed using db
product.categories.all()  # query performed again using db

Although this seems pretty straightforward, querysets can be misleading if they aren't defined correctly. Understanding when querysets populate the cache is really important to enhance the performance of an application. Results of the queryset are only cached when the entire queryset is loaded and evaluated. This means that evaluating only a part of the queryset - like in indexing or slicing - doesn't populate the cache. Here are a few examples to help you better understand caching in querysets -

# creates and evaluates the queryset but doesn't cache the results.
print([product.name for product in Product.objects.all()])

# hits the db again, creates and evaluate a new queryset.
print([product.upc for product in Product.objects.all()])


# stores the queryset to reuse and avoid such problems.
products = Product.objects.all()

# evaluates the queryset and caches the results.
print([product.name for product in products])

# fetches from the cache.
print([product.upc for product in products])
# creates a queryset
products = Product.objects.all()

# queries the database
print(products[5])

# queries the database again because the queryset was never evaluated completely
print(products[5])

# --------------------------------- #

# creates a queryset
products = Product.objects.all()

# queries the database and caches the results because it evaluates the entire queryset
[product for product in products]

# retrives from the cache
print(products[5])

# retrives using cache
print(products[5])

Indexing

This is a pretty standard technique to improve the speed of retrieval operations from the database. It is achieved either through Meta.indexes or Field.db_index in a model. Adding indexes to fields that are routinely queried upon in operations like filter(), exclude(), order_by(), etc. improves the performance of the application as indexes contribute to speed up lookups.

Indexing is only supposed to be added to frequently used fields because the overhead of maintaining an index may outweigh any gains in query speed.

# create indexes using Field.db_index value
class Product(models.Model):
    name = models.CharField(max_length=128, db_index=True)
    upc = models.CharField(max_length=32, unique=True)
    category = models.ForeignKey(
        Category, on_delete=models.PROTECT, related_name='categories', null=True)
    created_by = models.ForeignKey(
        User, on_delete=models.PROTECT, related_name='product_created_by', null=True)
    
    
# create indexes using Meta.indexes option
class Product(models.Model):
    name = models.CharField(max_length=128)
    upc = models.CharField(max_length=32, unique=True)
    category = models.ForeignKey(
        Category, on_delete=models.PROTECT, related_name='categories', null=True)
    created_by = models.ForeignKey(
        User, on_delete=models.PROTECT, related_name='product_created_by', null=True)
    
    class Meta:
        indexes = [
            models.Index(fields=['name'], name='product_name_idx'),
        ]    

select_related

select_related enhances performance when accessing related-object data like foreign-keyor one-to-one fields on the result instances of the queryset. Adding select_related to a queryset makes it a complex query to execute because it returns a queryset and fetches all fields of the related-object. But this means that you can access foreign key relationships after this without hitting the database.

# Hits the database.
product = Product.objects.get(id=10)

# Hits the database again to get the related Category object.
category = product.category


# using select_related
# Hits the database
product = Product.objects.select_related('category').get(id=10)

# Doesn't hit the database because Category obj is already fetched in the above query.
category = product.category

This might seem like restricting just one hit to the database - but imagine when we try and access related-object data in loops!

>>> Product.objects.filter(is_active=True).count()
200

category_names = set()

# Hits the database 201 times. 1 for fetching active products 
# and the rest 200 queries are for retrieving category for each product.
products = Product.objects.filter(is_active=True)
for product in products:
  category_names.add(product.category.name)
  

# using select_related

category_names = set()

# This is a single query and hits the database only once!!
products = Product.objects.filter(is_active=True).select_related('category')
for product in products:
  category_names.add(product.category.name)

To improve performance, chained foreign keys can also be referred to in select_related similar to the way join queries are queried. Here's an example -

product = Product.objects.get(id=10) # Hits the database.

category = product.category # Hits the database again.
user = category.created_by # Hits the database again!


# using select_related
product = Product.objects.select_related('category__created_by').get(id=10) # Hits the database

category = product.category # Doesn't hit the database
user = category.created_by # Doesn't hit the database

Lastly, here are a few things to note while using select_related-

  • The order of select_related and the filter doesn't matter. Both the below queries are analogous to each other.
Product.objects.filter(is_active=True).select_related('category')
Product.objects.select_related('category').filter(is_active=True)
  • It can be applied to reverse relationships of one-to-one fields, but the related_name should be given in arguments instead of a name.
  • To clear the related field values that were stored from past select_related calls on a queryset, pass None as an argument.
new_queryset = queryset.select_related(None)

select_related is magic, right? But then why is this restricted to single-valued relationships - foreign-key and one-to-one? 

That's because of its execution model. select_related creates a SQL join and includes the fields of the related object in the SELECT statement. So, this can only fetch related objects in the same database query. However, to avoid the much larger queryset that would result from joining across a many-to-many or many-to-one relationships, the select_prefetch method is limited to single-valued relationships.

prefetch_related

This is very similar to select_related in terms of clearing the inundation of database queries and hits while accessing related-objects but varies in the method of execution.

prefetch_related performs a new lookup for each relationship. The joining across relationships is done in Python here instead of in SQL. This is why it can prefetch many-to-many and many-to-one objects in addition to the relationships supported by selectrelated. Below is an example to refer to queries using `prefetchrelated `-

>>> Product.objects.count()
220

# 220 queries - __str__ is called for each product
Product.objects.all()
[ "Jeans (cotton, skinny)", "Neon Shirt (attrx, attry)", ...]


# 2 queries - 1 query for products and 1 query for attributes.
# attributes are prefetched when products query is executed.
Product.objects.all().prefetch_related('attribues')
[ "Jeans (cotton, skinny)", "Neon Shirt (attrx, attry)", ...]

In the above sample code, when __str__ is called for each product, it loads data from the QuerySet's cache instead of hitting and querying the database. That's because attributes are prefetched when a query is run for products. This means that the attributes of all products are fetched using one single query!

Things to note while using prefetch_related-

  • To prefetch a related attribute of an iterable instance, the prefetch_related_objects() function should be used.
  • Calling an iterator() on a prefetch_related() query will be ignored because these two optimizations don't make sense together.
  • Any subsequent query operations chained to a prefetch_related function implies a different database query which ignores previously cached results and fetches data creating a new database query.

SQL is faster than Python

F() expressions:

The F() object represents the value of a model field or annotated column. It performs database operations on model field values without having to pull them out of the database into Python memory. Eliminating this one step from the queryset evaluation process increases its performance drastically.

# Python's implementation
product = Product.objects.get(upc='PXO01')

# loads stock value into Python memory and uses Python's operators to increment
product.stock += 5
product.save()


# SQL incrementing using F().
from django.db.models import F

product = Product.objects.get(upc='PXO01')

# increments model field value(stock) by 5 units by an SQL query.
product.stock = F('stock') + 5
product.save()

When Django encounters F(), it overrides the standard Python operators and generates an SQL expression that describes the required operations at the database level. For example, in the above code, Python creates an SQL query to increment the stock by 5 units.

F() can also be used on querysets to perform updations. It combines the get() and save() operations of the above example by reducing two queries to one. Check out this one -

# Python's implementation
products = Product.objects.filter(category__name='shoes')
for product in products:
  product.stock += 5
  product.save()
  
# SQL incrementing using F() method.
Product.objects.filter(category__name='shoes').update(stock=F('stock') + 5)

Therefore, F() has the ability to boost performance by getting the database to perform operations rather than Python. It can also reduce the number of queries in some cases.

Using annotations to perform aggregations in the database.

But what are annotation and aggregation methods?

Annotation is used to create dynamic fields on models by performing various query expressions. Each argument to this method is an annotation that will be added to each object in the queryset that is returned.

# usage of annotation

# annotates number of attributes to each product.
products = Product.objects.annotate(number_of_attributes=Count('attributes'))
products[0].number_of_attributes
4

# the other way to define
products = Product.objects.annotate(Count('attributes'))
products[0].attributes__count
4

Aggregation is used to retrieve values that are derived by summarizing or aggregating a collection of objects.

# basic usage of aggregation

# Max price of all products.
from django.db.models import Max
Product.objects.all().aggregate(Max('price'))
{'price__max': 70.15}

# Average view count across all products.
from django.db.models import Avg
Product.objects.all().aggregate(Avg('view_count'))
{'view_count__avg': 10}

Aggregation presents a super-efficient way to work with arithmetic operations on the model fields. Since it's an SQL method, everything is performed in the database itself. While aggregation works wonders alone, it can be more efficient when combined with the annotation method.

from django.db.models import Avg, Count

# first annotates each product with number of attributes,
# and then aggregates the average number of authors using the annotated attribute

Product.objects.annotate(num_attributes=Count('attributes')).aggregate(Avg('num_attributes'))
{'num_authors__avg': 3.46}

Other small but significant optimizations

Retrieving a single object using unique/indexed fields

When an individual object is being retrieved through the get() method, using unique or indexed columns to make queries enhances performance because these columns are indexed in the database. It also makes sure that multiple objects don't match the query and hence speeds up the query execution.

# these 2 queries are faster because upc & id are unique and indexed.
product = Product.objects.get(upc='PXO01')
product = Product.objects.get(id=33)

# this is much slower than above queries because name isn't indexed.
product = Product.objects.get(name='Neon Shoes')

# this query can have multiple matches which hinders query performance.
product = Product.objects.get(name__startswith='Neon')

Limiting fetch to exactly what is required

values() & values_list(): The queryset's filter like get(), filter(), exclude(), etc. returns an entire ORM model object in results. When the whole object isn't required, applying the values() or values_list() method on these results returns a dict or list of required values instead of loading a complete model object. These functions can also be used for replacing model objects in the template code as long as they suffice the use case. Check the block below to see how values_list() elevates performance.

# takes more time because it loads the whole model object into Python memory
%timeit upcs = [product.upc for product in Product.objects.all()]                                                                                                                                                                                    
3.91 ms ± 131 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# faster than the above code because an SQL expression executes this
%timeit Product.objects.values_list('upc', flat=True)                                                                                                                                                                                                           
32.3 µs ± 1.12 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

defer() & only(): defer() can be used to eliminate columns that aren't required to avoid loading them. This is most effective when there are columns with heavy text data or fields that might take a lot of processing time to convert to Python. Please note that calling model fields that were deferred earlier should load data from a disk for every single row in the result queryset. So, profiling columns before deferring is super important.

# retrieving products with all fields
%timeit Product.objects.all()                                                                                                                                                                                                                        
6.08 µs ± 97.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


# fetching products after deferring name and category
%timeit Product.objects.defer('name', 'category')                                                                                                                                                                                                   
14.4 µs ± 103 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


# more about using 'defer()'

# defers both name and attributes columns
Product.objects.defer("name").filter(stock__gt=5).defer("attributes")

# defer also works on related object fields
Product.objects.select_related().defer("category__created_by")

only() is quite the opposite of defer(). only() works by replacing the default set of fields to load immediately. The best application is where almost all the fields need to be deferred. Also, using only() to specify the complementary set of fields can result in much simpler code.

# retrieving products with all fields
%timeit Product.objects.all()                                                                                                                                                                                                                        
6.08 µs ± 97.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

# loading products with only upc column
%timeit Product.objects.only('upc')                                                                                                                                                                                                                  
15.4 µs ± 117 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


# more about 'only()'
# this only loads upc and ignores earlier calls
Product.objects.only("name", "stock").only("upc")

# only and defer combined - this loads only upcs column immediately
Product.objects.only("name", "upc").defer("name")

# loads upc and name immediately as only() replaces any existing set of fields
Product.objects.defer("name").only("upc", "name")

Size and Existence

To get the size of the queryset, count() is much faster than len() as len() is a Python method. len() loads the entire queryset into memory to execute and return results which makes it slower than count().

# slower because it loads all products into Python memory to execute 'len()'
%timeit len(Product.objects.all(())                                                                                                                                                                                                                  
3.47 ms ± 36 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


# faster because an SQL expression executes this
%timeit Product.objects.count()                                                                                                                                                                                                                      
267 µs ± 7.51 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

To check if the queryset exists, the exists() method is faster than an if condition.

# executed by Python
%timeit if Product.objects.all(): print("valid")                                                                                                                                                                                                                    
4.19 ms ± 118 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


# executed in the database
%timeit Product.objects.exists()                                                                                                                                                                                                                     
226 µs ± 4.73 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In general, SQL expressions are faster than Python operators or methods.

To sum it up -

Django has an excellent ORM layer and inculcating these methods into queries can spike your application's performance. These initially may not seem super effective in smaller applications. But as your application grows, with more and more queries hitting the database, you'll start to see a huge difference in performance.

performance engineeringweb developmentdjango's ORMprogrammingorm tips

Got a project to discuss?