Integrating dbt and ClickHouse
The dbt-clickhouse Adapter
dbt (data build tool) enables analytics engineers to transform data in their warehouses by simply writing select statements. dbt handles materializing these select statements into objects in the database in the form of tables and views - performing the T of Extract Load and Transform (ELT). Users can create a model defined by a SELECT statement.
Within dbt, these models can be cross-referenced and layered to allow the construction of higher-level concepts. The boilerplate SQL required to connect models is automatically generated. Furthermore, dbt identifies dependencies between models and ensures they are created in the appropriate order using a directed acyclic graph (DAG).
dbt is compatible with ClickHouse through a ClickHouse-supported adapter.
Supported features
List of supported features:
- Table materialization
- View materialization
- Incremental materialization
- Microbatch incremental materialization
- Materialized View materializations (uses the
TO
form of MATERIALIZED VIEW, experimental) - Seeds
- Sources
- Docs generate
- Tests
- Snapshots
- Most dbt-utils macros (now included in dbt-core)
- Ephemeral materialization
- Distributed table materialization (experimental)
- Distributed incremental materialization (experimental)
- Contracts
- ClickHouse-specific column configurations (Codec, TTL...)
- ClickHouse-specific table settings (indexes, projections...)
All features up to dbt-core 1.9 are supported. We will soon add the features added in dbt-core 1.10.
This adapter is still not available for use inside dbt Cloud, but we expect to make it available soon. Please reach out to support to get more information on this.
dbt concepts and supported materializations
dbt introduces the concept of a model. This is defined as a SQL statement, potentially joining many tables. A model can be "materialized" in a number of ways. A materialization represents a build strategy for the model's select query. The code behind a materialization is boilerplate SQL that wraps your SELECT query in a statement in order to create a new or update an existing relation.
dbt provides 5 types of materialization. All of them are supported by dbt-clickhouse
:
- view (default): The model is built as a view in the database. At ClickHouse this is built as a view.
- table: The model is built as a table in the database. At ClickHouse this is built as a table.
- ephemeral: The model is not directly built in the database but is instead pulled into dependent models as CTEs (Common Table Expressions).
- incremental: The model is initially materialized as a table, and in subsequent runs, dbt inserts new rows and updates changed rows in the table.
- materialized view: The model is built as a materialized view in the database. At ClickHouse this is built as a materialized view.
Additional syntax and clauses define how these models should be updated if their underlying data changes. dbt generally recommends starting with the view materialization until performance becomes a concern. The table materialization provides a query time performance improvement by capturing the results of the model's query as a table at the expense of increased storage. The incremental approach builds on this further to allow subsequent updates to the underlying data to be captured in the target table.
The current adapter for ClickHouse supports also support dictionary, distributed table and distributed incremental materializations. The adapter also supports dbt snapshots and seeds.
The following are experimental features in dbt-clickhouse
:
Type | Supported? | Details |
---|---|---|
Materialized View materialization | YES, Experimental | Creates a materialized view. |
Distributed table materialization | YES, Experimental | Creates a distributed table. |
Distributed incremental materialization | YES, Experimental | Incremental model based on the same idea as distributed table. Note that not all strategies are supported, visit this for more info. |
Dictionary materialization | YES, Experimental | Creates a dictionary. |
Setup of dbt and the ClickHouse adapter
Install dbt-core and dbt-clickhouse
dbt provides several options for installing the command-line interface (CLI), which are detailed here. We recommend using pip
to install both dbt and dbt-clickhouse.
Provide dbt with the connection details for our ClickHouse instance.
Configure the clickhouse-service
profile in the ~/.dbt/profiles.yml
file and provide the schema, host, port, user, and password properties. The full list of connection configuration options is available in the Features and configurations page:
Create a dbt project
You can now use this profile in one of your existing projects or create a new one using:
Inside project_name
dir, update your dbt_project.yml
file to specify a profile name to connect to the ClickHouse server.
Test connection
Execute dbt debug
with the CLI tool to confirm whether dbt is able to connect to ClickHouse. Confirm the response includes Connection test: [OK connection ok]
indicating a successful connection.
Go to the guides page to learn more about how to use dbt with ClickHouse.
Testing and Deploying your models (CI/CD)
There are many ways to test and deploy your dbt project. dbt has some suggestions for best practice workflows and CI jobs. We are going to discuss several strategies, but keep in mind that these strategies may need to be deeply adjusted to fit your specific use case.
CI/CD with simple data tests and unit tests
One simple way to kick-start your CI pipeline is to run a ClickHouse cluster inside your job and then run your models against it. You can insert demo data into this cluster before running your models. You can just use a seed to populate the staging environment with a subset of your production data.
Once the data is inserted, you can then run your data tests and your unit tests.
Your CD step can be as simple as running dbt build
against your production ClickHouse cluster.
More complete CI/CD stage: Use recent data, only test affected models
One common strategy is to use Slim CI jobs, where only the modified models (and their up- and downstream dependencies) are re-deployed. This approach uses artifacts from your production runs (i.e., the dbt manifest) to reduce the run time of your project and ensure there is no schema drift across environments.
To keep your development environments in sync and avoid running your models against stale deployments, you can use clone or even defer.
We recommend using a dedicated ClickHouse cluster or service for the testing environment (i.e., a staging environment) to avoid impacting the operation of your production environment. To ensure the testing environment is representative, it's important that you use a subset of your production data, as well as run dbt in a way that prevents schema drift between environments.
- If you don't need fresh data to test against, you can restore a backup of your production data into the staging environment.
- If you need fresh data to test against, you can use a combination of the
remoteSecure()
table function and refreshable materialized views to insert at the desired frequency. Another option is to use object storage as an intermediate and periodically write data from your production service, then import it into the staging environment using the object storage table functions or ClickPipes (for continuous ingestion).
Using a dedicated environment for CI testing also allows you to perform manual testing without impacting your production environment. For example, you may want to point a BI tool to this environment for testing.
For deployment (i.e., the CD step), we recommend using the artifacts from your production deployments to only update the models that have changed. This requires setting up object storage (e.g., S3) as intermediate storage for your dbt artifacts. Once that is set up, you can run a command like dbt build --select state:modified+ --state path/to/last/deploy/state.json
to selectively rebuild the minimum amount of models needed based on what changed since the last run in production.
Troubleshooting common issues
Connections
If you encounter issues connecting to ClickHouse from dbt, make sure the following criteria are met:
- The engine must be one of the supported engines.
- You must have adequate permissions to access the database.
- If you're not using the default table engine for the database, you must specify a table engine in your model configuration.
Understanding long-running operations
Some operations may take longer than expected due to specific ClickHouse queries. To gain more insight into which queries are taking longer, increase the log level to debug
— this will print the time used by each query. For example, this can be achieved by appending --log-level debug
to dbt commands.
Limitations
The current ClickHouse adapter for dbt has several limitations users should be aware of:
- The plugin uses syntax that requires ClickHouse version 25.3 or newer. We do not test older versions of Clickhouse. We also do not currently test Replicated tables.
- Different runs of the
dbt-adapter
may collide if they are run at the same time as internally they can use the same table names for the same operations. For more information, check the issue #420. - The adapter currently materializes models as tables using an INSERT INTO SELECT. This effectively means data duplication if the run is executed again. Very large datasets (PB) can result in extremely long run times, making some models unviable. To improve performance, use ClickHouse Materialized Views by implementing the view as
materialized: materialization_view
. Additionally, aim to minimize the number of rows returned by any query by utilizingGROUP BY
where possible. Prefer models that summarize data over those that simply transform while maintaining row counts of the source. - To use Distributed tables to represent a model, users must create the underlying replicated tables on each node manually. The Distributed table can, in turn, be created on top of these. The adapter does not manage cluster creation.
- When dbt creates a relation (table/view) in a database, it usually creates it as:
{{ database }}.{{ schema }}.{{ table/view id }}
. ClickHouse has no notion of schemas. The adapter therefore uses{{schema}}.{{ table/view id }}
, whereschema
is the ClickHouse database. - Ephemeral models/CTEs don't work if placed before the
INSERT INTO
in a ClickHouse insert statement, see https://github.com/ClickHouse/ClickHouse/issues/30323. This should not affect most models, but care should be taken where an ephemeral model is placed in model definitions and other SQL statements.
Fivetran
The dbt-clickhouse
connector is also available for use in Fivetran transformations, allowing seamless integration and transformation capabilities directly within the Fivetran platform using dbt
.