Intermediate models, a flexible tool for modeling DRY dbt data pipelines

Leverage intermediate models to improve your dbt project

Jeremie Pineau
${around_the_block}

--

Most dbt practitioners agree on a common definition for staging, fact and dimension models. Intermediate models, however, are always a bit more ambiguous. And for good reason!

You can view intermediate models as the glue or the bridge that holds your pipeline together. Intermediate models work behind the scenes, away from your end user’s eyes to ensure your pipeline works efficiently and delivers the value you expect from your analytics project.

We shouldn’t perceive intermediate models as simply another layer within your project, but rather as a tool that can be molded into whatever the context demands. They don’t belong in an isolated layer like our staging models, which serve the purpose of cleaning our source data for our analytics needs. They belong everywhere.

That sounds great and all, but what about some examples of valuable use cases for our intermediate models? Here below I’ve compiled a few examples of ways you can leverage intermediate models to improve a dbt project.

Before digging into more concrete examples, I want to preface by saying that not every example I elaborate on below follows Ralph Kimball’s dimensional modeling approach. That being said, Kimball’s technique is generally known as the go-to approach for modeling data pipelines today that we, at Montreal Analytics, also agree on. Knowing when to follow best practices but especially knowing when to bend the rules in favour of a more contextually adapted approach is what truly makes great analytics engineers.

#1 — Breaking down a complex model

Intermediate models can be pretty useful at breaking down complex scripts. If you have a huge script with a few hundreds of lines and multiple big CTEs that execute complex bits of logic with a high likelihood to encounter errors, the content of those CTEs is a prime example of code that can be moved to a separate intermediate model.

It allows you to separate this complex bit of logic into its own model, allowing you to test the output of this particular operation while also separating strong points of failure into multiple scripts, making debugging much easier. Furthermore, this makes learning about the model easier when it is broken down in multiple parts, since each piece of logic is separated and can be documented explicitly.

In the example below, we are creating a fact fulfillment model in which we have a complex CTE that enriches tracking data with customer support manual fixes. Since we believe there is a high likelihood of failure in joining manually generated customer support data, we can create an intermediate model to enrich our tracking data and test the output of this model, isolating a potential point of failure for our fact fulfillment model.

#2 — Making your code DRY-er by centralizing pieces of logic needed in multiple downstream models

Optimizing your data pipeline’s performance starts with making sure no resources are wasted repeating the same process. You may have heard of the concept of D.R.Y. or Don’t Repeat Yourself from software development. The same thing applies to developing data pipelines — if two models need to perform the same calculation, you should find a way to centralize this piece of logic. Good news is, intermediate models are a great way to do this!

Instead of having the same aggregation of another model done in two different CTEs within two different models, you should create an intermediate model to perform this calculation and refer to the output of said intermediate model in both of your downstream models.

This point is where I may stray from dbt Labs’ recommended best practices. They typically recommend for your intermediate models to be referred in a single model downstream. However, from my experience at Montreal Analytics, intermediate models can also be of great use when we need to centralize bits of logic to improve performance.

#3 — Improve performance leveraging incremental materialization

Marts level models will often require you to perform aggregation on large quantities of data. Those aggregation queries often scale out of proportion as your project grows since more data needs to be loaded in memory to perform those aggregations each time new data is added to a source.

Bringing this aggregation into a separated intermediate model is a good solution, since you will be able to process the latest data incrementally, something you wouldn’t have been able to do as part of a CTE. Separating this aggregation CTE also allows you to reuse it in other mart models, which might also need to leverage the same calculations and would’ve incurred a lot of processing time.

In the example below, our stg_impressions contains a very high volume of data that needs to be aggregated to create the fct_impressions model. By creating an incremental intermediate model to take care of the aggregation, we can speed up our pipeline dramatically without necessarily having to make the fct_impressions model incremental itself.

#4 — Centralize dimensional logic before processing dimensionalized facts

Dimension models typically contain descriptive information about a given object that holds value for your business, such as a customer, a product or a vendor. We often use those dimension models within our Business Intelligence tools to slice and dice our facts by the relevant segments.

One use case that we end up seeing often is the need to create segments out of aggregated facts. As an example, let’s say a company wants to analyze its customers based on their lifetime value. This lifetime value would be a sum of all revenue generated by each customers and we can source this information from our fact transaction model. This is what we call a dimensionalized fact or an aggregated facts as dimension attribute.

Looking at the flowchart below, we can see what our dbt pipeline might look like when creating this intermediate model to enrich our customer data with demographic data. However, since we are centralizing the dimensional data about our customers in an intermediate model first, we would also be able to use this information within our fact transaction model without causing circular reference, due to the dimensionalized facts in our dimension customers (see dotted line).

These are but a few examples of what you can use intermediate models for, but I hope it clarifies my approach regarding using intermediate models as a flexible tool to help shape and optimize your data pipelines. You can read further on recommended best practices, naming convention and structure in this article by dbt Labs.

Feel free to share with us some of your own use cases for intermediate models in the comments below.

If you have any questions or feedback, you can reach out to me on LinkedIn or on Twitter!

About Montreal Analytics

Proudly bearing the city name where it was born, Montreal Analytics is a full-stack data consultancy that leverages the power of various cloud technologies to further the democratization of data within companies. We help companies build up their Analytics organization through strategy, technology, as well as technical and business training.

We are trusted consulting partners of dbt Labs, Snowflake, Looker and Fivetran.

Let’s chat 👉🏼 https://montrealanalytics.com

We’re hiring! 👉🏼 Job Board

--

--