STORAGE.CLUSTERING_SCHEME | Agile Data Engine

Target database: Databricks SQL

STORAGE.CLUSTERING_SCHEME option enables defining clustering for an entity. Requires also SORT_KEY to be defined.

Target DMBS	Entity types	Feature
Databricks SQL	EXTERNAL_TABLE	Clusters entity with SQL snippet defined in clustering scheme by replacing the defined <entity_clustering_key> with the SORT_KEY column(s)

See also:

https://docs.databricks.com/aws/en/sql/language-manual/sql-ref-syntax-ddl-create-table-using

Usage

For Databricks SQL define the SQL snippet defining the clustering rule and refer to the sort key with syntax <entity_clustering_key>.

One can refer to individual SORT_KEY attributes with their numeric position in case the clustering pattern has need for separating the sort key columns in multiple sections of the cluster by clause. Syntax for this is <entity_clustering_key_N> (where N is 1-N based on attribute position in SORT_KEY).

Notes

Check Export SQL/Entity SQL after defining a SORT_KEY to see generated DDL

Databricks SQL

Can be used with external tables, define the clustering column(s) with SORT_KEY.

Examples

Databricks SQL clustered external table

External table cluster by clause contains three sections:

clustering columns
sorting order (optional)
number of buckets to cluster the content to

Therefore STORAGE.CLUSTERING_SCHEME needs to be used together with SORT_KEY to define the clustering

Key definition:

KEY NAME: F_TRIP
KEY TYPE: SORT_KEY
ATTRIBUTE NAME 1: category
ATTRIBUTE NAME 2: event_time

Physical option definition:

Physical option name: STORAGE.CLUSTERING_SCHEME
Physical option value: (<entity_clustering_key_1>) SORTED BY <entity_clustering_key_2> DESC INTO 5 BUCKETS

DDL generated by Agile Data Engine:

SQL

CREATE TABLE pub.F_TRIP
(
  ...
)
LOCATION 's3::/data/location'  
PARTITION BY (category) SORTED BY event_time DESC INTO 5 BUCKETS;