BackgroundBy default, Django 1.0 supports only a single, hardcoded database connection, defined by the settings.DATABASE_XXX variables. All models use the same connection. A scalable website needs several different database connections when MySQL is used. Models might be stored in separate databases, and sometimes a read-only replica of a database might be used for querying. For sharding purposes, several different databases might be used for the same model.
Using a custom Manager to choose the database
The "objects" attribute of Django models can be replaced with a custom Manager class. This allows you to override the get_query_set() method, which chooses the database connection. For example:
from django import db
class CustomSettings: DATABASE_HOST = '...'
CUSTOM_SETTINGS = CustomSettings()
class CustomManager(db.models.Manager): def custom_conn(self): conn = db.backend.DatabaseWrapper() cursor = lambda mgr: mgr._cursor(CUSTOM_SETTINGS) conn.cursor = new.instancemethod(cursor, conn, db.backend.DatabaseWrapper)
def get_query_set(self): query = db.models.sql.Query(self.custom_conn()) return CustomQuerySet(self.model, query)
class MyModel(db.models.Model): objects = CustomManager()
Managing transaction commits
Unfortunately, Django still uses a hard-coded connection object to commit transactions after saving models. This must be overridden or nothing will happen. First the INSERT:
class CustomManager(...): ...
def _insert(self, values, **kwargs): return self.insert_query(self.model, values, **kwargs) def insert_query(self, model, values, return_id=False, raw_values=False): conn = self.custom_conn() query = db.sql.InsertQuery(model, conn) query.insert_values(values, raw_values) rv = query.execute_sql(return_id) # Need to commit here conn._commit(); return rv
The UPDATE can't be committed in CustomManager, because it's called directly in the QuerySet object. This means you have to use an custom QuerySet object like this:
def _update(self, values, kw):
rv = super(CustomQuerySet, self)._update(values, kw)
# Need to commit here self.query.connection._commit() return rv
Choosing the shard in queries
When sharding is used, the database connection depends not only on the model, but also some parameter that needs to be passed with the query. Choosing a readonly connection is very similar. This can be done by adding a method to the custom Manager class:
def get_query_set(self, shard_id=None, readonly=False):
# Use the parameters to choose the connection .... def shard(self, shard_id): return self.get_query_set(shard_id) def readonly(self): return self.get_query_set(None, True)
To query the model from a specific shard
To use a read-only connection
ConclusionMultiple database and sharding are possible, but they require overriding some Django internals. This might break in future versions of Django.
Apparently work is already in progress for Django support multiple databases by default. Hopefully all the cases described in this article will be supported (multiple databases, read-only databases and sharded databases).
Update:The DELETE operation was actually quite difficult to implement, because Django uses a global module-level delete_objects() function instead of going through the Manager and/or QuerySet. My current solution is to override the Model's delete() and manually delete by the primary key from the table.
ForeignKey references in models also cause problems when the field is not really a foreign key but resides in another database. I had to change them into IntegerFields and handle them manually in application code.
It would be really nice if Django had some generic database query context, which could be set up when querying or inserting/updating/deleting models. This context could be used to carry information about what kind of database connection is needed. The global module-level connection object could then be replaced with a call to some overrideable function that would see the context and could decide which connection to return.