Firstly, it’s worth putting my cards on the table and admitting that I am a staunch Kimballite. Dimensional models, star schemas, Kimball method, call it what you will, it’s the data modelling approach that I cut my teeth on. I sometimes think that my brain now thinks in facts and dimensions.
As someone who has predominantly worked in the Business Intelligence space, dimensional modelling has been consistently the best fit for the use cases I have been involved in.
I think I’m less dogmatic than I used to be. Joe Reis has often talked about Mixed Model Arts and I certainly see the value in being multi-disciplinarian. A dimensional model should never be the only answer and it won’t always be the best answer either.
But given the work that I do, it ends up being the approach I use most often (in fact this feels somewhat chicken and egg… if it’s a scenario where a dimensional model isn’t the right approach, it’s likely a use case that I’m not best suited for).
Aren’t Star Schemas dead?
Of course dimensional models have come in for a fair amount of hostility in their time, and have been declared dead on umpteen occasions (see exhibits one, two and three).
But for every obituary written there’s another thought piece that will tell you that dimensional models are still thriving.
Simon Whiteley penned this piece a few years back. Piethein Strengholt talks about “Kimball-style star schema” as the best practice for the Gold Layer of a lakehouse here.
Vendors like to get in on the act too. Databricks recognise star schemas as a valid approach on their platform, and of course my baby, Power BI, has guidance on why using star schema is important.
Crisis of faith
Despite the evidence presented above that dimensional models are alive and well, I think it is important sometimes to question your own beliefs. And the question did come up recently with a colleague when asked to justify to a client why a dimensional model is a good choice.
Unfortunately just saying “because so and so says so” doth butter no parsnips.
And I have to admit, for me, using a star schema has become such a force of habit, that maybe I’ve also lost track of the rationale for it.
So this has prompted me to revisit some of the source material and to reacquaint myself with why we do dimensional modelling.
What does Kimball say?
Despite being the definitive guide to dimensional modelling, tracking down a concise reference to why you should use the approach is actually quite hard. There are sections about the purpose of Data Warehousing, such as to isolate operational and analytical environments.
There is a list of requirements for a DW/BI system:
simple and fast
consistent
adaptive to change
timely
secure
trustworthy
embraced
That’s a great list, but I’d argue that you can fulfil those requirements without a dimensional model. Likewise, a dimensional model done poorly probably won’t tick all of those boxes.
A recurring theme though keeps going back to the first bullet - simple and fast. This is best captured in the introduction of the 3rd edition of The Data Warehouse Toolkit where it describes dimensional models as having “goals of user understandability and query performance”.
Honestly, I left Kimball feeling a little underwhelmed by his justifications for his approach, often boiling down to “dimensional modeling has been broadly accepted as the dominant technique” i.e. everyone else is doing it, so you should too.
Star Schema: The Complete Reference
My other go to recommendation for dimensional modelling guidance is Star Schema: The Complete Reference by Christopher Adamson. Alas, I found a similar outcome in that I couldn’t find a definitive why. Sure, similar benefits from Kimball are revisited and listed. Again, this mostly focusses on why data modelled for analytics is better than simply querying operational data.
I’m sure most people are already sold on that, but it still doesn’t really address why a star schema specifically is a good choice.
Other Sources
I don’t intend to question the value of either of those two seminal texts. They are incredibly valuable contributions and cover the how and what of dimensional modelling comprehensively. You should own and read both of them.
Likewise, you should own Agile Data Warehouse Design by Lawrence Corr - For me, this book has always been more about the requirements gathering aspect of dimensional models as well as patterns to implement them in an agile and iterative way. Never the less, it does cover some fundamentals and benefits for the approach. And the emphasis echoes what I’ve already discussed - simplicity and speed of querying.
Other places I looked for answers were Fundamentals of Analytics Engineering (De Wilde et al) and Fundamentals of Data Engineering (Housley & Reis).
Both of these give comparisons of Inmon, Kimball and Data Vault approaches.
My interpretation of both comparisons are that Inmon and Data Vault both cover integration, historic data capture and flexibility well. In fact they don’t even necessarily compete against star schemas, and a recommended approach is often to have a star schema sitting on top of either of these approaches as the reporting layer, with Inmon or Data Vault as the integration layer sitting before this.
Again, this is because star schema is simpler and quicker to query than those underlying integration layers.
The Big Fat Elephant in the room
All of the above could lead to the conclusion that dimensional models thrive when simplicity and speed are top of mind for your data platform requirements.
Except I’d argue that there is an alternative that is even simpler and faster… that being Big Fat Tables aka One Big Table (or OBT).
Before the advent of column store formats, the dimensional approach to modelling made a lot sense, because thin fact tables meant scanning less data for large volume tables. But column store formats negate that advantage, as the query plan only takes in to account the columns required for a given query, not whole rows.
If the argument is that star schema is quicker than fully normalised data structures because it reduces the number of joins required by a query, then the ultimate optimisation is surely to eliminate joins all together and have everything in one big table?
Not only is that likely quicker and more performant than star schema, but surely easier for end users to comprehend too?
So does OBT eliminate the advantages of star schema?
The Goldilocks Effect
Hold your horses!
One thing we haven’t really discussed is maintenance and reusability. If OBT becomes your modelling approach of choice you may well end up with a term largely associated with the proliferation of dbt- model sprawl!
Each new business question potentially becomes a new OBT. Before you know it you have hundreds if not thousands of OBTs scattered around your estate. And what happens when there is a change in business logic for a particular measure or attribute? Or a new descriptive field or characteristic needs adding to a certain concept? Now you have all those OBTs that need updating to include customer shoe size…
In a dimensional model, it’s just one update to your customer dimension that needs to be made, and then every single fact table that joins to your customer dimension inherits a new attribute that allows you to slice and dice your data.
Star Schema isn’t optimal for this still. A subject close to home for me might be an address change and the redrawing of county boundaries. I grew up in South Humberside, but before I left home, the local authority changed to be North East Lincolnshire. Reflecting that change in my data warehouse would still be thousands of rows to update for every address stored, so star schemas still have some element of data redundancy.
If a normalised structure was used instead, there would only be one county record to update. But then that re-introduces the daisy chain of table relationships that will hit query performance.
All of the above doesn’t even consider the BAU compute costs of processing so many stand alone tables.
So for me, star schema represents the goldilocks option. It gives me sufficient simplicity, performance, flexibility and maintenance overhead versus other approaches.
It’s a solid all rounder. It may not be best in class for all of those categories, but it performs comparatively well in all scenarios. Someone once described Excel as being the second best tool for any job, and that’s why it’s so well adopted. Maybe star schemas occupy a similar space.
Tools for the job
The other factor that I think plays a part is your tooling. This kind of feels a bit chicken and egg too… are BI tools built to support star schemas because they are the dominant approach, or do we build star schemas because that’s what BI tools expect? Maybe we’ll never know, but certainly the aforementioned Power BI works best with star schema. But see also Thoughtspot and Databricks Genie rooms, which also recommend star schema as the optimal data structure to sit under these.
If a job needs doing…
So, having trawled through my catalogue of dimensional modelling related literature, I’m not sure that I came up with a great answer for why we dimensional model.
But I did have to present a slide on this exact question recently. And so this is what I came up with:
Why do we dimensional model?
To simplify complex data: dimensional models restructure operational data into easily understandable formats, using facts (measurable data) and dimensions (descriptive data). This can enhance user accessibility. Business users can understand and navigate the model more easily because it's designed around how they think about data (e.g., sales by region and time).
To improve query performance: Dimensional models are denormalized, which reduces the number of joins and speeds up queries. Yet they still maintain flexibility and maintainability by not being too denormalized. There should still be one version of each business concept.
Because tooling demands it: To make best use of your technology choices, some applications will expect data in a star schema format.
Final Thoughts
So, yes, I’m still a dimensional model fan - it will continue to be my go to and if you want any help with with dimensionally modelling your data, you know where I am.
Is it the be all and end all? No - depending on what you’re looking to optimise for, other approaches may suit you better and indeed it’s possible that you may even use a combination of techniques at different layers in your stack.
I still see dimensional models having a long term future.
The reasoning behind the potential pitfalls of OBT are really well thought out, and I’ll likely wheel them out at some future point when I’m inevitably caught up in some debate of Kimball v OBT.
“Someone once described Excel as being the second best tool for any job, and that’s why it’s so well adopted. Maybe star schemas occupy a similar space.”
Love this! Will def steal.
Great read. I often have to remind myself why we do it this way. I had to have that chat with someone about the Big Fat Table approach and felt slightly ill-equipped to answer with great convince until now.