What is Canopy?
Canopy is an extension to LightDB that distributes data and queries across multiple nodes in a cluster. Because Canopy is an extension (not a fork) to LightDB, when you use Canopy, you are also using LightDB. You can leverage the latest LightDB features, tooling, and ecosystem.
Canopy transforms Canopy into a distributed database with features like sharding, a distributed SQL engine, reference tables, and distributed tables. The Canopy combination of parallelism, keeping more data in memory, and higher I/O bandwidth can lead to significant performance improvements for multi-tenant SaaS applications, customer-facing real-time analytics dashboards, and time series workloads.
How Far Can Canopy Scale?
Canopy scales horizontally by adding worker nodes, and vertically by upgrading workers/coordinator.
When to Use Canopy
Multi-Tenant Database
Most B2B applications already have the notion of a tenant, customer, or account built into their data model. In this model, the database serves many tenants, each of whose data is separate from other tenants.
Canopy provides full SQL coverage for this workload, and enables scaling out your relational database to 100K+ tenants. Canopy also adds new features for multi-tenancy. For example, Canopy supports tenant isolation to provide performance guarantees for large tenants, and has the concept of reference tables to reduce data duplication across tenants.
These capabilities allow you to scale out your tenants’ data across many machines, and easily add more CPU, memory, and disk resources. Further, sharing the same database schema across multiple tenants makes efficient use of hardware resources and simplifies database management.
Some advantages of Canopy for multi-tenant applications:
Fast queries for all tenants
Sharding logic in the database, not the application
Hold more data than possible in single-node LightDB
Scale out without giving up SQL
Maintain performance under high concurrency
Fast metrics analysis across customer base
Easily scale to handle new customer signups
Isolate resource usage of large and small customers
Real-Time Analytics
Canopy supports real-time queries over large datasets. Commonly these queries occur in rapidly growing event systems or systems with time series data. Example use cases include:
Analytic dashboards with subsecond response times
Exploratory queries on unfolding events
Large dataset archival and reporting
Analyzing sessions with funnel, segmentation, and cohort queries
Canopy’ benefits here are its ability to parallelize query execution and scale linearly with the number of worker databases in a cluster. Some advantages of Canopy for real-time applications:
Maintain sub-second responses as the dataset grows
Analyze new events and new data as it happens, in real-time
Parallelize SQL queries
Scale out without giving up SQL
Maintain performance under high concurrency
Fast responses to dashboard queries
Use one database, not a patchwork
Rich LightDB data types and extensions
Considerations for Use
Canopy extends LightDB with distributed functionality, but it is not a drop-in replacement that scales out all workloads. A performant Canopy cluster involves thinking about the data model, tooling, and choice of SQL features used.
A good way to think about tools and SQL features is the following: if your workload aligns with use-cases described here and you happen to run into an unsupported tool or query, then there’s usually a good workaround.
When Canopy is Inappropriate
Some workloads don’t need a powerful distributed database, while others require a large flow of information between worker nodes. In the first case Canopy is unnecessary, and in the second not generally performant. Here are some examples:
When single-node LightDB can support your application and you do not expect to grow
Offline analytics, without the need for real-time ingest nor real-time queries
Analytics apps that do not need to support a large number of concurrent users
Queries that return data-heavy ETL results rather than summaries