SQL Support and Workarounds

As Canopy provides distributed functionality by extending LightDB, it is compatible with LightDB constructs. This means that users can use the tools and features that come with the rich and extensible LightDB ecosystem for distributed tables created with Canopy.

Canopy has 100% SQL coverage for any queries it is able to execute on a single worker node. These kind of queries are common in Multi-tenant Applications when accessing information about a single tenant.

Even cross-node queries (used for parallel computations) support most SQL features. However, some SQL features are not supported for queries which combine information from multiple nodes.

Limitations for Cross-Node SQL Queries:

  • SELECT … FOR UPDATE work in single-shard queries only

  • TABLESAMPLE work in single-shard queries only

  • Correlated subqueries are supported only when the correlation is on the Distribution Column.

  • Outer joins between distributed tables are only supported on the Distribution Column

  • Outer joins between distributed tables and reference tables or local tables are only supported if the distributed table is on the outer side

  • Recursive CTEs work in single-shard queries only

  • Grouping sets work in single-shard queries only

To learn more about LightDB and its features, you can visit the LightDB documentation. For a detailed reference of the LightDB SQL command dialect (which can be used as is by Canopy users), you can see the SQL Command Reference.

Workarounds

Before attempting workarounds consider whether Canopy is appropriate for your situation. Canopy’ current version works well for real-time analytics and multi-tenant use cases.

Canopy supports all SQL statements in the multi-tenant use-case. Even in the real-time analytics use-cases, with queries that span across nodes, Canopy supports the majority of statements. The few types of unsupported queries are listed in Are there any LightDB features not supported by Canopy? Many of the unsupported features have workarounds; below are a number of the most useful.

Work around limitations using CTEs

When a SQL query is unsupported, one way to work around it is using CTEs, which use what we call pull-push execution.

SELECT * FROM ref LEFT JOIN dist USING (id) WHERE dist.value > 10;
/*
ERROR:  cannot pushdown the subquery
DETAIL:  There exist a reference table in the outer part of the outer join
*/

To work around this limitation, you can turn the query into a router query by wrapping the distributed part in a CTE

WITH x AS (SELECT * FROM dist WHERE dist.value > 10)
SELECT * FROM ref LEFT JOIN x USING (id);

Remember that the coordinator will send the results of the CTE to all workers which require it for processing. Thus it’s best to either add the most specific filters and limits to the inner query as possible, or else aggregate the table. That reduces the network overhead which such a query can cause. More about this in Subquery/CTE Network Overhead.

Temp Tables: the Workaround of Last Resort

There are still a few queries that are unsupported even with the use of push-pull execution via subqueries. One of them is using grouping sets on a distributed table.

In our real-time analytics tutorial we created a table called github_events, distributed by the column user_id. Let’s query it and find the earliest events for a preselected set of repos, grouped by combinations of event type and event publicity. A convenient way to do this is with grouping sets. However, as mentioned, this feature is not yet supported in distributed queries:

-- this won't work

  SELECT repo_id, event_type, event_public,
         grouping(event_type, event_public),
         min(created_at)
    FROM github_events
   WHERE repo_id IN (8514, 15435, 19438, 21692)
GROUP BY repo_id, ROLLUP(event_type, event_public);
ERROR:  could not run distributed query with GROUPING
HINT:  Consider using an equality filter on the distributed table's partition column.

There is a trick, though. We can pull the relevant information to the coordinator as a temporary table:

-- grab the data, minus the aggregate, into a local table

CREATE TEMP TABLE results AS (
  SELECT repo_id, event_type, event_public, created_at
    FROM github_events
       WHERE repo_id IN (8514, 15435, 19438, 21692)
    );

-- now run the aggregate locally

  SELECT repo_id, event_type, event_public,
         grouping(event_type, event_public),
         min(created_at)
    FROM results
GROUP BY repo_id, ROLLUP(event_type, event_public);
.
 repo_id |    event_type     | event_public | grouping |         min
---------+-------------------+--------------+----------+---------------------
    8514 | PullRequestEvent  | t            |        0 | 2016-12-01 05:32:54
    8514 | IssueCommentEvent | t            |        0 | 2016-12-01 05:32:57
   19438 | IssueCommentEvent | t            |        0 | 2016-12-01 05:48:56
   21692 | WatchEvent        | t            |        0 | 2016-12-01 06:01:23
   15435 | WatchEvent        | t            |        0 | 2016-12-01 05:40:24
   21692 | WatchEvent        |              |        1 | 2016-12-01 06:01:23
   15435 | WatchEvent        |              |        1 | 2016-12-01 05:40:24
    8514 | PullRequestEvent  |              |        1 | 2016-12-01 05:32:54
    8514 | IssueCommentEvent |              |        1 | 2016-12-01 05:32:57
   19438 | IssueCommentEvent |              |        1 | 2016-12-01 05:48:56
   15435 |                   |              |        3 | 2016-12-01 05:40:24
   21692 |                   |              |        3 | 2016-12-01 06:01:23
   19438 |                   |              |        3 | 2016-12-01 05:48:56
    8514 |                   |              |        3 | 2016-12-01 05:32:54

Creating a temporary table on the coordinator is a last resort. It is limited by the disk size and CPU of the node.