Determining Application Type
Running efficient queries on a Canopy cluster requires that data be properly distributed across machines. This varies by the type of application and its query patterns.
There are broadly two kinds of applications that work very well on Canopy. The first step in data modeling is to identify which of them more closely resembles your application.
At a Glance
Multi-Tenant Applications |
Real-Time Applications |
---|---|
Sometimes dozens or hundreds of tables in schema |
Small number of tables |
Queries relating to one tenant (company/store) at a time |
Relatively simple analytics queries with aggregations |
OLTP workloads for serving web clients |
High ingest volume of mostly immutable data |
OLAP workloads that serve per-tenant analytical queries |
Often centering around a big table of events |
Examples and Characteristics
Multi-Tenant Application
These are typically SaaS applications that serve other companies, accounts, or organizations. Most SaaS applications are inherently relational. They have a natural dimension on which to distribute data across nodes: just shard by tenant_id.
Canopy enables you to scale out your database to millions of tenants without having to re-architect your application. You can keep the relational semantics you need, like joins, foreign key constraints, transactions, ACID, and consistency.
Examples: Websites which host store-fronts for other businesses, such as a digital marketing solution, or a sales automation tool.
Characteristics: Queries relating to a single tenant rather than joining information across tenants. This includes OLTP workloads for serving web clients, and OLAP workloads that serve per-tenant analytical queries. Having dozens or hundreds of tables in your database schema is also an indicator for the multi-tenant data model.
Scaling a multi-tenant app with Canopy also requires minimal changes to application code. We have support for popular frameworks like Ruby on Rails and Django.
Real-Time Analytics
Applications needing massive parallelism, coordinating hundreds of cores for fast results to numerical, statistical, or counting queries. By sharding and parallelizing SQL queries across multiple nodes, Canopy makes it possible to perform real-time queries across billions of records in under a second.
Examples: Customer-facing analytics dashboards requiring sub-second response times.
Characteristics: Few tables, often centering around a big table of device-, site- or user-events and requiring high ingest volume of mostly immutable data. Relatively simple (but computationally intensive) analytics queries involving several aggregations and GROUP BYs.
If your situation resembles either case above then the next step is to decide how to shard your data in the Canopy cluster. As explained in the Concepts section, Canopy assigns table rows to shards according to the hashed value of the table’s distribution column. The database administrator’s choice of distribution columns needs to match the access patterns of typical queries to ensure performance.