Data Federation With CockroachDB and Presto

Motivation

A customer inquired whether data federation is possible natively in CockroachDB. Unfortunately, CockroachDB does not support features like foreign data wrappers and such. A quick search returned a slew of possibilities and Presto being a prominent choice, sparked my interest.

High-Level Steps

  • Install Presto
  • Configure Postgresql catalog
  • Configure TPCH catalog
  • Verify
  • Wrap up

Step by Step Instructions

Install Presto

I’m using a Mac and luckily there’s a homebrew package available.

Add connectors to /usr/local/opt/prestodb/libexec/etc/catalog/. See:
https://prestodb.io/docs/current/connector.html

To restart prestodb after an upgrade:
  brew services restart prestodb
Or, if you don't want/need a background service you can just run:
  /usr/local/opt/prestodb/bin/presto-server run
==> Summary
  /usr/local/Cellar/prestodb/0.268: 1,967 files, 1.3GB

Remember the location of the connector catalog, we will use it shortly.

Before we can launch Presto, we need to configure JAVA_HOME. I’m on OSX and I used the following directions to set it up.

Configure Postgresql Connector

I am using CockroachDB Serverless for my data source. You can spin a free forever instance using this link

Being Postgresql wire compatible database, we can leverage the Postgresql connector to connect to CockroachDB with Presto, fill in your details and place the postgresql.properties file into /usr/local/opt/prestodb/libexec/etc/catalog/ directory.

connector.name=postgresql
connection-url=jdbc:postgresql://free-tier.gcp-us-central1.cockroachlabs.cloud:26257/yourclustername-5521.tpch?sslmode=verify-full&sslrootcert=/Users/username/.postgresql/root.crt
connection-user=yourusername
connection-password=yourpassword

A few things here, first, variable expansion does not work, you must specify the full path to the root certificate. In CockroachDB Serverless, you also need to prepend the database name tpch with the cluster name. So in other words, you need clustername.tpch and in my case artem-serverless-5521.tpch. Otherwise, you get the following error if you attempt to access the database connector.

Query 20220119_202320_00000_es5a4 failed: FATAL: codeParamsRoutingFailed: missing cluster name in connection string

Configure the Built-In TPCH Connector

Presto comes with a few built-in datasets, TPCH being one of them. I figured for time’s sake to leverage what’s out there.

Create tpch.properties file and place it in the same location as the postgresql.properties file.

At this point, we can start the Presto server:

presto-server run  --server localhost:8088

The port selected 8088 is in case you’re using an instance of CockroachDB with DB Console using the default port 8080. We’re trying to avoid port-used errors here.

Once the server is up, we can connect to the Presto CLI and pass the catalogs we’re going to use during our session:

presto --server localhost:8088 --catalog postgresql,tpch

Let’s browse the tables from within Presto:

Schema       

--------------------

 information_schema 

 sf1                

 sf100              

 sf1000             

 sf10000            

 sf100000           

 sf300              

 sf3000             

 sf30000            

 tiny

For some reason, I was unable to show tables in the TPCH workload from within Presto but knowing the workload, we know the table names are below:

  Table 
---------- 
 customer 
 lineitem 
 nation 
 orders
 part 
 partsupp 
 region 
 supplier

Using the fully qualified table names, we can fetch the table information:

describe tpch.sf100.part;

   Column    |    Type     | Extra | Comment 
-------------+-------------+-------+---------
 partkey     | bigint      |       |         
 name        | varchar(55) |       |         
 mfgr        | varchar(25) |       |         
 brand       | varchar(10) |       |         
 type        | varchar(25) |       |         
 size        | integer     |       |         
 container   | varchar(10) |       |         
 retailprice | double      |       |         
 comment     | varchar(23) |       |         

We do not package TPCH workload with our binary and have to find it elsewhere. I found the dataset at the following site. Exporting the part table and importing into CockroachDB is out of the scope of this article. I may follow up with a tutorial on that at a later time.

Once I have TPCH data loaded into my CockroachDB cluster, we can view it, since we’re in Presto CLI, let’s use that to inspect the data.

Verify

The hierarchy in Presto is based on the first catalog.

SHOW SCHEMAS FROM postgresql;

Schema       

--------------------

 crdb_internal      

 information_schema 

 pg_catalog         

 pg_extension       

 public

We know our data is in the default public schema. We have to reference both hierarchies to view the tables.

USE postgresql.public;
SHOW TABLES;

Table 
-------
 part  
(1 row)

Query 20220119_220205_00013_z5ufp, FINISHED, 1 node
Splits: 19 total, 19 done (100.00%)
0:03 [1 rows, 20B] [0 rows/s, 6B/s] 

We can now query CockroachDB and Presto TPCH datasets simultaneously:

SELECT partkey, name, mfgr, brand, type, size, container, retailprice, comment FROM tpch.sf100.part A INNER JOIN postgresql.public.part B ON A.partkey = B.p_partkey WHERE A.partkey > 10000 AND A.partkey <= 10010;

While the query runs, let’s take a look at the Presto UI, recall we passed --server localhost:8088 the argument. The server exposes a UI we can access at the following address.

presto:public> SELECT partkey, name, mfgr, brand, type, size, container, retailprice, comment FROM tpch.sf100.part A INNER JOIN postgresql.public.part B ON A.partkey = B.p_partkey WHERE A.partkey > 10000 AND A.partkey <= 10010;
 partkey |                  name                  |      mfgr      |  brand   |          type           | size | contai
---------+----------------------------------------+----------------+----------+-------------------------+------+-------
   10001 | saddle honeydew red plum indian        | Manufacturer#4 | Brand#42 | SMALL ANODIZED STEEL    |   34 | SM BAG
   10002 | cyan grey saddle olive pale            | Manufacturer#5 | Brand#51 | STANDARD BRUSHED BRASS  |    3 | SM PAC
   10003 | spring sky moccasin orchid forest      | Manufacturer#2 | Brand#23 | STANDARD PLATED COPPER  |   30 | WRAP C
   10004 | burlywood dim moccasin forest yellow   | Manufacturer#1 | Brand#11 | ECONOMY ANODIZED COPPER |    4 | MED JA
   10005 | sky mint magenta black misty           | Manufacturer#3 | Brand#35 | MEDIUM POLISHED COPPER  |    4 | WRAP P
   10006 | azure metallic violet puff saddle      | Manufacturer#2 | Brand#25 | LARGE ANODIZED TIN      |   28 | JUMBO 
   10007 | burlywood ivory pink yellow forest     | Manufacturer#2 | Brand#25 | PROMO POLISHED BRASS    |    2 | SM CAN
   10008 | light lavender lime spring burnished   | Manufacturer#2 | Brand#24 | ECONOMY PLATED TIN      |   16 | MED CA
   10009 | indian smoke lime magenta tan          | Manufacturer#4 | Brand#44 | SMALL ANODIZED BRASS    |   47 | MED CA
   10010 | sienna saddle chartreuse rose honeydew | Manufacturer#3 | Brand#32 | MEDIUM ANODIZED NICKEL  |   10 | MED DR
(10 rows)

Query 20220119_220404_00014_z5ufp, FINISHED, 1 node
Splits: 81 total, 81 done (100.00%)
14:10 [20M rows, 0B] [23.5K rows/s, 0B/s]

We are accessing tpch.sf100.part dataset A that comes standard with Presto on the left and CockroachDB table on the right postgresql.public.part B simultaneously. This is a very simplistic scenario but at least it provides ideas on how to leverage Presto in federating data out of CockroachDB.

There’s more to learn about Presto and I’m only scratching the surface with this article. The query takes a bit of time given my Presto environment is not optimized. I defer to the Presto documentation for further tuning.

Update

My colleague suggested leveraging the follower reads pattern for this use case. I think it’s a brilliant idea and it makes follower reads a very suitable choice considering you’d use it in a live system. Alas, the CockroachDB specific AS OF SYSTEM TIME syntax is not compatible with Presto:

In cases where AS OF SYSTEM TIME syntax is not supported by third-party tooling, our engineering team introduced a session variable default_transaction_use_follower_reads=on we can leverage and force all read queries through follower reads. All we have to do is stop Presto, configure the postgresql.properties file with the session variable appended to the connection string and restart.

connection-url=jdbc:postgresql://free-tier.gcp-us-central1.cockroachlabs.cloud:26257/yourclustername-5521.tpch?sslmode=verify-full&sslrootcert=/Users/username/.postgresql/root.crt&default_transaction_use_follower_reads=on

Let’s run our query again but without AS OF SYSTEM TIME syntax

presto> SELECT p_partkey, p_name, p_mfgr, p_brand, p_type, p_size, p_container, p_retailprice, p_comment FROM postgresql.public.part limit 5;
 p_partkey |               p_name               |          p_mfgr           |  p_brand   |          p_type          | p_size | 
-----------+------------------------------------+---------------------------+------------+--------------------------+--------+-
         1 | burlywood plum powder puff mint    | Manufacturer#4            | Brand#43   | LARGE PLATED TIN         |     31 | 
         2 | hot spring dodger dim light        | Manufacturer#5            | Brand#55   | LARGE POLISHED STEEL     |      4 | 
         3 | dark slate grey steel misty        | Manufacturer#5            | Brand#53   | STANDARD PLATED COPPER   |     30 | 
         4 | cream turquoise dark thistle light | Manufacturer#1            | Brand#13   | STANDARD BURNISHED BRASS |      3 | 
         5 | drab papaya lemon orange yellow    | Manufacturer#1            | Brand#14   | ECONOMY BRUSHED BRASS    |      7 | 
(5 rows)

At this point let’s navigate to the Statements page and observe whether queries are indeed read via follower reads.

Here are all of the queries on this cluster:

Cluster Queries

Let’s filter by the JDBC Driver:

PostgreSQL JDBC Driver

The following list is all of the statements executed via our connection:

List of Statements

The two statements we need to focus on are our query and SET SESSION statement. Let’s look at the query first:

Query Statement

Let’s look at the SET SESSION

SET SESSION Statement

Conclusion

This tutorial only scratches the surface of federating data using an external project like Presto. The world of possibilities is ahead and I hope I was able to rouse some excitement about this new application.

.

Leave a Comment