Data Modeling Fundamentals
Data modeling is the practice of organizing data for storage and access. The fundamentals — fact tables, dimensions, normalization, denormalization — have been stable for decades. The implementations evolve; the concepts don't.
This page covers the core concepts and the choices for different workloads.
OLTP vs. OLAP
The fundamental distinction:
OLTP (Online Transaction Processing)
- Operational systems: orders, users, sessions
- Many small reads and writes
- Strong consistency
- Highly normalized: minimize redundancy
The order is created once; read once or twice; updated occasionally. Data lives in a relational database (PostgreSQL, MySQL).
OLAP (Online Analytical Processing)
- Analytics: revenue analysis, user cohorts, reporting
- Few large reads
- Aggregations across millions of rows
- Highly denormalized: optimize for read performance
The data is read many times in different aggregations. Data lives in a warehouse (Snowflake, BigQuery, Redshift).
The two have different shapes. OLTP normalizes; OLAP denormalizes.
Normalization
Splitting data across tables to eliminate redundancy.
1NF, 2NF, 3NF
Database normalization rules:
- **1NF**: atomic values (no arrays in cells)
- **2NF**: 1NF + non-key columns depend on the whole key
- **3NF**: 2NF + non-key columns depend only on the key
Most OLTP databases are in 3NF (or close to it).
Why normalize OLTP
- **Integrity**: update once; consistent everywhere
- **Storage**: less redundancy
- **Updates**: clean
Why denormalize OLAP
- **Performance**: aggregations across large tables benefit from co-located data
- **Simplicity**: fewer joins for common queries
- **Read-only**: don't need update integrity
Star schema
The classic OLAP pattern:
Fact tables
Numerical measurements. Each row is an event or transaction.
```
fct_orders
├── order_id (key)
├── customer_id (FK to dim_customer)
├── product_id (FK to dim_product)
├── date_id (FK to dim_date)
├── amount (measure)
└── quantity (measure)
```
Columns are foreign keys to dimensions plus measures (the numbers).
Dimension tables
Descriptive context.
```
dim_customer
├── customer_id (key)
├── name
├── signup_date
├── tier
└── city
dim_product
├── product_id (key)
├── name
├── category
├── brand
└── price
```
The "star" comes from the diagram: fact in center, dimensions around it.
Why it works
- Joins are simple (always fact → dimensions)
- Aggregations are fast
- Dimensions are denormalized (one row per customer with all attributes)
- Fact tables can be enormous; dimensions stay manageable
Snowflake schema
Like star, but dimensions are normalized further.
```
dim_customer → dim_city → dim_country
```
Pros: less storage; cleaner.
Cons: more joins.
For modern warehouses (cheap storage, fast joins), star usually wins. Snowflake schemas show up in older OLAP designs.
Slowly changing dimensions (SCD)
Dimensions change over time. A customer's tier upgrades. Their address changes.
How to handle the change:
Type 1: overwrite
Just update. No history.
Type 2: add new row
Old row marked end-of-life; new row created with new values. History preserved.
```
customer_id | name | tier | valid_from | valid_to | is_current
1 | Alice | bronze | 2020-01-01 | 2022-06-01 | false
1 | Alice | gold | 2022-06-01 | (null) | true
```
For analytics that care about historical state, Type 2 is essential. "What tier was Alice when she made this purchase?"
Type 3: separate column for previous value
Limited history (only the previous state).
For most warehouses, Type 2 is the default for dimensions where history matters.
Wide vs. tall
A choice for fact tables:
Wide (one column per measurement)
```
date | revenue | costs | profit | margin
```
Easy queries; many columns; harder to add new measurements.
Tall (one row per measurement)
```
date | metric | value
```
Easy to extend; harder for some queries (need pivots).
For operational metrics, tall is flexible. For business KPIs with stable structure, wide is clearer.
Modern adjustments
Cloud warehouses change some traditional advice:
Storage is cheap
Denormalize aggressively. Don't pre-aggregate to save space; let warehouses store the detail.
Joins are fast
Normalization concerns matter less. Star schemas in modern warehouses join quickly.
Semi-structured data
JSON columns in PostgreSQL, BigQuery, Snowflake. Sometimes the right answer; often a sign of underbaked modeling.
dbt-style layers
The staging-intermediate-marts pattern. Each layer applies more business logic; all transformations are SQL.
When to model up front vs. iteratively
Up front
For mature business domains where you understand the entities. Designing OLTP schemas for a known domain.
Iteratively
For analytics where requirements emerge. Start with raw data; build dimensions and facts as needed.
For warehouses, iterative usually wins. The transformation layer in dbt makes restructuring relatively cheap.
Common failure patterns
- **OLTP schema for OLAP queries.** Slow aggregations.
- **Pre-aggregating in OLTP.** Loses detail; doesn't scale to new questions.
- **JSON for everything.** Hard to query; inefficient at scale.
- **No SCD strategy.** Lose history; can't answer "what was the value at this time."
- **One big wide table.** Hard to update; query patterns get complex.
- **No documentation.** Future engineers don't know what tables represent.
Further Reading
- [EtlVsElt](EtlVsElt) — How transformations land in models
- [DbtAndAnalyticsEngineering](DbtAndAnalyticsEngineering) — Tool for modeling
- [DataPipelineDesign](DataPipelineDesign) — Pipelines that build models
- [JpaAndHibernatePatterns](JpaAndHibernatePatterns) — OLTP modeling perspective
- [DataEngineering Hub](DataEngineeringHub) — Cluster index