Multiple databases and sharding with Django

Disclaimer: I'm still learning Django so there may be smarter ways of doing the things I describe here. I also haven't tested this much though the principle seems to work. The example code has been edited and may not be fully correct.

Background

By default, Django 1.0 supports only a single, hardcoded database connection, defined by the settings.DATABASE_XXX variables. All models use the same connection.

A scalable website needs several different database connections when MySQL is used. Models might be stored in separate databases, and sometimes a read-only replica of a database might be used for querying. For sharding purposes, several different databases might be used for the same model.

Using a custom Manager to choose the database

The "objects" attribute of Django models can be replaced with a custom Manager class. This allows you to override the get_query_set() method, which chooses the database connection. For example:

from django import db
import new
class CustomSettings:
DATABASE_HOST = '…'
CUSTOM_SETTINGS = CustomSettings()
class CustomManager(db.models.Manager):
def custom_conn(self):
conn = db.backend.DatabaseWrapper()
cursor = lambda mgr: mgr._cursor(CUSTOM_SETTINGS)
conn.cursor = new.instancemethod(cursor, conn,
db.backend.DatabaseWrapper)
def get_query_set(self):
    query = db.models.sql.Query(self.custom_conn())
    return CustomQuerySet(self.model, query)

class MyModel(db.models.Model):
objects = CustomManager()

Managing transaction commits

Unfortunately, Django still uses a hard-coded connection object to commit transactions after saving models. This must be overridden or nothing will happen. First the INSERT:

class CustomManager(...):
    ...
def _insert(self, values, **kwargs):
    return self.insert_query(self.model, values, **kwargs)

def insert_query(self, model, values,
        return_id=False, raw_values=False):
    conn = self.custom_conn()
    query = db.sql.InsertQuery(model, conn)
    query.insert_values(values, raw_values)
    rv = query.execute_sql(return_id)
    # Need to commit here
    conn._commit();
    return rv</code></pre>

The UPDATE can't be committed in CustomManager, because it's called directly in the QuerySet object. This means you have to use an custom QuerySet object like this:
class CustomQuerySet(db.models.query.QuerySet):
    def _update(self, values, **kw):
        rv = super(CustomQuerySet, self)._update(values, **kw)
        # Need to commit here
        self.query.connection._commit()
        return rv
Choosing the shard in queries
When sharding is used, the database connection depends not only on the model, but also some parameter that needs to be passed with the query. Choosing a readonly connection is very similar. This can be done by adding a method to the custom Manager class:
class CustomManager(...):
    ...
    def get_query_set(self, shard_id=None, readonly=False):
        # Use the parameters to choose the connection
        ....
def shard(self, shard_id):
    return self.get_query_set(shard_id)

def readonly(self):
    return self.get_query_set(None, True)


  To query the model from a specific shard
  
    
    Link to heading
  

MyModel.objects.shard(42).get(id=76)

  To use a read-only connection
  
    
    Link to heading
  

MyModel.objects.readonly().get(id=76)

Conclusion
Multiple database and sharding are possible, but they require overriding some Django internals. This might break in future versions of Django.
Apparently work is already in progress for Django support multiple databases by default. Hopefully all the cases described in this article will be supported (multiple databases, read-only databases and sharded databases).
Update:
The DELETE operation was actually quite difficult to implement, because Django uses a global module-level delete_objects() function instead of going through the Manager and/or QuerySet. My current solution is to override the Model's delete() and manually delete by the primary key from the table.
ForeignKey references in models also cause problems when the field is not really a foreign key but resides in another database. I had to change them into IntegerFields and handle them manually in application code.
It would be really nice if Django had some generic database query context, which could be set up when querying or inserting/updating/deleting models. This context could be used to carry information about what kind of database connection is needed. The global module-level connection object could then be replaced with a call to some overrideable function that would see the context and could decide which connection to return.

Background

Using a custom Manager to choose the database

Managing transaction commits

Choosing the shard in queries

To query the model from a specific shard Link to heading

To use a read-only connection Link to heading

Conclusion

Update: