TG: Programming: Multi-Tenancy

Home - Programming - Multi-Tenancy

A common topic that comes up when building products that are cloud-based or otherwise centralized is that of storing multiple clients' data using a single database instance. These are my thoughts on that topic.

The problem set can typically be summarized as:

Minimizing database licensing fees
Ability to silo the data by tenant
Ability to (simply) query data across tenants
Ability to apply schema updates across tenant in one step
Ability to create a new tenant, with initialized tenant data
Ability to remove a tenant
Ability to move a tenant between database instances

There are certainly more that could be suggested based on use case, but I'm going to focus on the combined and conflicting requirements of limiting access by tenant while still being able to access all of the data as a single dataset, and I'm going to describe three approaches: using a tenant ID, using database schemas, and using separate database instances.

A tenant ID is typically a separate integer or string field added to every table in the system that contains a value identifying each row as being owned by a single tenant. This is multi-tenancy in its conceptually simplest form, and can be applied to any kind of persistence method. but that conceptual simplicity comes at a high cost; it can only be realized as a lowest common denominator solution and puts 100% of the onus on the application developer and system administrator to secure the data and limit visibility between tenants. All SQL operations must be performed using the correct tenant ID, unless cross-tenant actions are desired. It forces all tenants sharing a single database instance to share the same hardware regardless of actual usage: the drives, drive controllers, bus, RAM, CPU, etc. They also share the same database configurations and database feature licensing constraints. Worse, the software that uses that data is constrained since the database structure is the same for all tenants; you can't have different table structures for different clients without creating separate tables and controlling which is referenced with a configuration flag, or splitting that tenant off into a separate database instance.

Database schemas require a bit more knowledge of how databases work, but these effectively eliminate all the negatives of using a tenant ID. Schemas can not only have differing table structures, most SQL databases allow specifying tablespace settings that include the logical drive on which to house the data and indices, and the database performance and optimization settings to use. Access to schemas is granted via the connection credentials and, if used, the database role assigned to the user account used to connect, so while a typical tenant user may only have read-write-delete access to their schema, a high level user might be granted read-only access to multiple or even all schemas for analysis purposes.

And then there's the third option, separate database instances. This effectively takes the schemas concept to the next level, by having a database instance per client. If you're implementing a cloud solution in which your application and services are database agnostic, you may want to offer to your clients the ability to specify their preferred database, for which they will handle the licensing and hosting costs. All else being equal, this approach has the same benefits as using schemas. One of the negatives, though, comes in doing cross-tenant access. While querying across schemas is much slower than having all the data in a single table, querying across instances is even worse, and combining data between disparate SQL engines is a performance gamble. On the other hand, it could be argued that, if you're using separate databases for each tenant, you probably aren't going to do any cross-tenant reporting, and if you need analytical or monitoring information, you can surface a common API across all databases to allow you to aggregate that data.

My personal preference is to use database schemas. I've tried the tenant ID thing, and at the end I was completely unsatisfied with how it turned out. It was a pain to implement and test, and I couldn't just give a tenant unfettered access to their data for reporting and customization purposes without exposing every other tenant's data. Full disclosure, at the time I was designing the system, the idea of using schemas didn't occur to me or other two architects I was working with -- basically a collective failure of imagination. Given all the other bad-ass stuff that went into that project, I'm surprised we dropped the ball on that one. That design decision was made 10 years ago and, to the best of my knowledge, it's still in play. Had I not left that company 5 years ago, replacing the tenant ID with schemas would be high on my to-do list.