This article was written by Silja Märdla, Senior Analytics Engineer at Bolt.
TL;DR
This article describes how we’ve replaced manual LookML writing with an automatic sync process to ensure all the core models, metrics and marts defined in dbt are automatically available for Looker users.
Background
At Bolt, we operate in a near-perfect data mesh architecture. The Data Platform team offers the tooling, and the data analytics community creates the content according to their needs. In short, here’s what the toolbox looks like: S3 to store data, in-house tooling and Fivetran for data ingestion, dbt for data transformation, Databricks for on-demand compute, Presto for always-on compute, and Looker for reporting.
For a long time, there was very little guidance on what the content could or should look like. This setup has enabled our data users to move fast and create loads of impact. At one point, though, we started noticing some downsides of this setup. The number of data models in dbt was growing, and the number of assets in Looker was increasing, yet the data users were growing less satisfied with what was available to them.
We hired analytics engineers to improve the situation. Investigations in the dbt repository revealed many end-to-end models hosting similar business logic. Yet, quite often, a new task merited a new model because the models weren’t designed with a generic purpose. So we changed that and established some reusable core datasets: dimension and fact tables for the main business entities and pre-aggregated data marts to expose the main business metrics on frequently used granularities (per city, per user, etc). We defined those business metrics in dbt code, using semantic models and saved queries to generate the data marts.
I promise to write more about these, but I want to concentrate on how dbt and Looker play together.
dbt and Looker overlap
Reporting is not the only user of core datasets: experimentation, data science, product analytics, ad hoc analytics, and more use them. Therefore, we didn’t want to lock our business logic into the reporting tool. We have dbt in our toolbox, the purpose of which is to define data transformations. It was only logical that our core datasets would be represented in dbt.
When analysing this decision further, I see a rather considerable overlap between dbt and Looker offerings, at least in the data modelling domain. Both can host business logic to combine new tables from existing ones: models in dbt and derived tables in Looker. Both can define KPIs, which are metrics in dbt measures in Looker. Therefore, both tools can be reasonable candidates for hosting the magical “single source of truth” of the business logic.
Looker is, in fact, doing a good job of covering all the use cases. You can do a lot inside Looker to define your tables and measures. The APIs let you access that business logic from outside of Looker. For many companies, using Looker for their data model is a great idea. It might become more complex and costly for a big company like Bolt.
In Bolt, we have clear use cases for defining and pre-calculating data marts in dbt. To do so, we need our business logic to be available in dbt. That’s why we appreciate the ability to define metrics and saved queries that then define many of our data marts. Using metrics has enabled us to keep things consistent and separate technical ownership from business ownership. Check out my Metrics in Code talk from Bolt’s Analytics Meetup for more information.
We don’t want to duplicate the logic. We don’t want to define revenue calculation rules in dbt for data marts and again in Looker to have them available for all ad hoc questions. That’s why dbt offers semantic layer interfaces for all kinds of tools to connect to, but unfortunately, they don’t provide native support for Looker.
Looker views
What does it really mean to expose business logic defined in dbt to Looker? Let’s walk through the principles step by step.
Dimension tables
Let’s look at a dimension table. On the dbt side, it’s a table with columns.
On the Looker side, a dimension table becomes a view with dimensions and dimension groups.
I’ve used a simple example of a date table here, but the same applies to any dimension table describing an entity. Our examples include dim_city, dim_car, dim_driver etc.
Notice that we want the descriptions to be unified, so we bring those along from dbt to Looker. Although I’ve used plain descriptionsin the sample code above, we use doc blocks to define the descriptions in dbt. Again, the details make up a separate story.
Fact tables
Now, let’s look at how we use fact tables in Looker. Here’s a typical fact table about orders, in our case ride-hailing orders, defined as a dbt model:
On the dbt side, it’s again a table with columns, but there are two kinds of columns. There are “dimensions”, i.e. columns that help us identify and describe the row (order_key, order_city_id, order_created_ts, user_id and campaign_code). And numeric columns provide us with measurable information about our business (order_value_eur, order_discount_eur).
Before we see how this would look on the Looker side, let’s introduce metrics. We want to track metrics like total revenue before discounts, total discounts, and total revenue after discounts. We might want to calculate these metrics per city, per user, or per campaign code, so we define these in our semantic layer.
We also define a derived metric to calculate the revenue after discounts:
In fact, there are more details involved in defining the metrics that I’m omitting here; these have been previously described here. The COALESCE is here for better null handling, something we figured out after that article (see the discussion in metric flow/issues/764).
Back to Looker now. If we treated this fact table the same way we treated the dimension table, we’d end up with all the columns becoming dimensions in Looker. Sure, we could filter orders by the specific amount of revenue the order generated for us, but that’s not what we want in most cases. Instead, we want Looker to understand the difference between dimensions and measures. We want the information already provided in the semantic definitions to be reflected in Looker.
We really want a view in Looker where the dimension columns are defined as dimensions and the numeric columns are defined as measures. We also wish to the derived metric definition to be available as a measure.
We’re still figuring out how to do derived metrics that depend on different upstream tables, but we’re sure it’s not a huge issue.
Data marts
A typical data mart is an aggregation on a certain granularity, optionally with additional attributes that usually belong to the entities that make up the specific granularity. To explain this complicated sentence, let’s look at our city + date grain data mart. The definition of this data mart is stored in a saved query:
In human language, this reads as “calculate the revenue and discount metrics per city and date, also adding the city name and calendar year for easy filtering”. In SQL, this reads as something like
What about Looker, then? Since we’ve already pre-calculated these metrics on the user + date grain, do we need to calculate them again in Looker? Not necessarily. We can expose the pre-calculated data mart to Looker.
From a dbt perspective, the data mart is still a table with columns. We could use the same approach with dimension tables where all the columns become dimensions in Looker. We want to make the data mart even more convenient for Looker users. Yes, we can get the city + date grain metric values, but we can just as well use this data mart to aggregate further, for example, to the monthly level. There are fewer rows to read than in the fact table, and the query will be faster.
We can do further aggregation because we have additive metrics in this specific data mart (as opposed to non-additive metrics like averages or distinct counts, which I can discuss in yet another separate article). We want the metric columns to become measures in the Looker view:
There are some interesting things to note here:
We keep the naming convention for the metrics/measures;
We reuse the metric descriptions;
The measures’ actual definition is different from what we previously had in the fact table: we’re simply summing the column from the pre-aggregated data mart, not from the original fact table.
There are different ways to achieve this outcome in Looker, based on what we define in the dbt code base. We can start by manually tagging columns that we want to become measures instead of dimensions. We can specify the kind of aggregation we want (we’ve used sum here, but in some cases, we might prefer something else). We can use the saved query definition to catch the columns we want to become measures. We can make use of the metric definitions or their naming conventions.
Also, there are ways to make it more convenient and easy to understand for Looker users to switch between metrics stored in pre-aggregated data marts or exposed as definitions directly on top of the fact tables. We’ll work on finding what’s convenient for users.
Explores and models
The previous section described the kinds of views we could create using our dbt models. Sure, they make sense, but do they really expose the data for Looker users? Not yet.
There are two more steps needed for Looker users to be able to query the data from Looker. We need to use the views in explores, and those explores need to be used in models.
Single table explores
The simplest way to expose a table to users is to create a “blank” exploration.
With this, we’ve taken what we have in the dim_date view file and let people explore data based on the view definition. We’ve kept the naming 1:1 on purpose to make it evident that we’re exposing a specific table from dbt.
Curated explores
With many tables, we’d like to share a bit more context. The owners of dbt models can curate the explorations based on their models. They can add appropriate joins, which is especially relevant for fact tables, which we often want to join dimension tables. We can ensure efficient data reading by adding mandatory filters or making sure partition keys are used in joins. We can also add valuable filters, parameters, etc. Here’s an example of a curated explore of our ride-hailing order model:
Models
We try to keep our explorations in separate files. As a final step, we include them in a model to expose them to users.
We operate in a hub-and-spoke mode, so in general, it should be enough to include the core datasets in the hub project, and all the other projects would immediately get access as well. However, it’s perfectly fine to include the same explorations in multiple different models or curate the specific list each model should include.
As a future improvement, we want to figure out how to combine the available tables for convenient use. We might have a detailed fact model, a daily aggregation, and a monthly aggregation, all of which would allow us to calculate the same metric. The Looker used doesn’t necessarily want to know which one their data comes from or which one is the best.
Automation and refinements
After establishing the principles of exposing dbt models to Looker it becomes evident that the process can be automated. We want our Looker repository always to reflect the latest state of our dbt repository.
At the moment we have a pilot integration with euno.ai that helps us generate those LookML views. Eventually we also want to automate the creation of (initially empty) refinement files and those single table explores.
We configure the sync process from the dbt repository by setting up synchronisation rules, such as which models to sync, where to sync, and how to sync. Every update to the synced models triggers a pull request to the Looker repository, reflecting the changes. We keep a human in the loop to approve the pull requests, but we might eventually drop that step.
Together with the Euno team, we have established a process for when these dbt models will appear on the Looker side.
The sync process fully manages the destination (dbt_sync) folder with views, i.e., it will attempt to delete any extra files and overwrite any manual changes made to the view files. This way, we ensure a “clean” representation of what has been defined on the dbt side.
To add any business logic on the Looker side, we refine the views. This can include Looker-specific additions like parameters, but it can also include additional business logic (more about that in the next section).
We expose every table as an explore to ensure it’s available and not affected by any additional logic added in Looker (e.g., fan-out joins that can create confusion).
We allow table owners to curate their explores (either directly or via refinements; this is up for discussion). As described above, curation enforces best practices that are known to the table owners but might be more difficult for users to get right.
We allow the explores to be included in multiple models.
We allow the explores to be further refined or extended in a specific context, e.g. in a specific model.
Sustainability
When setting long-term goals and focus areas for the analytics engineering function, I could summarise them in one word: sustainability. What sets sound analytics engineering apart is that the solutions are reliable and easy to maintain in the longer-term context of an ever-changing business.
So, how is this structure sustainable? Not everything needs to be part of our code data model, and not everything starts as an essential business concept. By allowing us to add logic on top of our core data, we enable data users the freedom to experiment and move quickly in the environment they’re familiar with. When this additional business logic becomes widely used and is ready to become a part of the core, we can “shift left” — move the logic from Looker to dbt. Making the shift left is relatively easy because the new logic has been built on reusable core logic.
Afterthoughts
We fully acknowledge there are probably more efficient ways to manage a company’s data. Remember that we didn’t start unifying our business logic from scratch. By the time we started worrying about the single source of truth, we were already heavy users of dbt and Looker: we had more than 1K models in dbt, more than 4K datasets in Looker and about 15K charts built on top of these. Our tooling will evolve over the years, but we currently have a combination of DBT and Looker. Systematically figuring out how we use them together will help us, at least in the near future. Acknowledgements: I did not figure out all the concepts and details myself. I’ve had numerous brainstorming sessions within Bolt and with the team at Euno. An especially big thanks to Eyal Firstenberg, Pawel Kupidura, Larisse Pinto, Bram Verhoef, Pedro Paiva, Chirag Agarwal and Erik Heintare for all of your thoughts on the topic.
Join us!
Bolt is a place where you can grow professionally at lightning speed and create a real impact on a global scale.
Take a look at our careers page and browse through hundreds of open roles, each offering an exciting opportunity to contribute to making cities for people, not cars.
If you’re ready to work in an exciting, dynamic, fast-paced industry and are not afraid of a challenge, we’re waiting for you!