Data Catalog Tools

A data catalog is metadata about data. What datasets exist; what's in them; who owns them; how they're used; what they connect to. For organizations with dozens of data systems and many users, catalogs are essential. For smaller setups, often overkill.

This page covers what catalogs actually do and when they're worth deploying.

What a catalog provides

Discovery

"Does this data exist somewhere?" "Who else has worked with customer churn data?" Without a catalog, finding existing data is tribal knowledge.

Lineage

"Where does the customer_revenue table come from?" Trace upstream to sources. Trace downstream to dashboards and reports.

For debugging ("why did this number change?") and impact analysis ("if I change this, what breaks?"), lineage is critical.

Documentation

What does each column mean? What's the granularity? What's the freshness? Catalogs centralize this.

Ownership

Who owns this dataset? Who do I ask if it's broken? Without explicit ownership, data is everyone's problem and nobody's.

Usage

Who's querying this dataset? How often? Useful for understanding what to deprecate, what to optimize.

Governance

Tags for sensitivity (PII, confidential), retention rules, access controls. Often integrated with security systems.

The major tools

DataHub (LinkedIn)

Open-source. Comprehensive: lineage, documentation, ownership, usage. Active community.

Strengths: full-featured; flexible.

Weaknesses: complex to deploy and operate.

Amundsen (Lyft)

Open-source. Search-first design. Lighter than DataHub.

Strengths: simpler; easier to deploy.

Weaknesses: less feature-rich.

Atlan

Commercial SaaS. Modern UX; strong on collaboration. Popular with analytics teams.

Alation

Commercial. Enterprise-focused. Strong governance features.

Collibra

Commercial. Heavy governance focus. Common in regulated industries.

Apache Atlas

Open-source. Hadoop-era origins; still used in some stacks.

dbt Cloud / dbt docs

Lightweight catalog scoped to dbt models. Sufficient for dbt-centric teams.

When a catalog is essential

- 50+ datasets across multiple systems

- Many users (analysts, data scientists, engineers) consuming data

- Compliance requirements (data sensitivity tracking)

- Multi-team data ownership

- Lineage tracking for impact analysis

When it's overkill

- Small data team (<5 people) with shared knowledge

- Single warehouse with stable structure

- dbt's built-in docs sufficient

- Catalog adoption requires more effort than the value provided

The honest reality: many data catalogs become shelfware. Deployed; not maintained; not used. The metadata is stale; users don't trust it; the catalog adds nothing.

What makes catalogs work in practice

Automated metadata extraction

Manual catalog maintenance fails. The catalog must pull metadata from sources automatically:

- Schema from databases

- Lineage from dbt, Airflow

- Usage from query logs

- Ownership from tags or org structure

Integrated, not separate

The catalog should integrate with daily tools:

- IDE plugins for SQL editors

- Slack notifications for schema changes

- Dashboard tool integration

A catalog that requires people to leave their tools is rarely used.

Active stewardship

Datasets need owners who maintain documentation. The catalog tracks accountability; the work is human.

Search that actually works

Most users search the catalog. If search is bad, the catalog is unused.

Trust

Users trust the catalog when its information is current and accurate. Trust takes time to build; one wrong piece of information loses it.

Adoption patterns

For organizations adopting a catalog:

Start with one team

Pilot with the analytics team. Get feedback; iterate. Don't roll out company-wide before the tool works for one team.

Automate from the start

Don't ask people to manually populate the catalog. They won't. Automate metadata extraction.

Define ownership

Every dataset gets an owner before it's catalogued. Without ownership, the catalog has no maintainer.

Measure usage

Are people actually using it? Search counts, page views, edits. Low usage = the catalog isn't valuable; investigate.

Iterate

Catalog tools evolve. The catalog itself should evolve. Add features as needed; remove things nobody uses.

Common failure patterns

- **Catalog without automation.** Manual maintenance fails.

- **Catalog without ownership.** Nobody maintains.

- **Catalog without search.** Nobody can find anything.

- **Catalog as project.** Deploy and walk away. Becomes shelfware.

- **Buying enterprise platform when dbt docs would suffice.** Over-engineered.

- **Underfunded catalog initiative.** Half-deployed; never finished.

A reasonable approach

For most organizations:

1. Determine if you actually need a catalog (size, complexity)

2. Start with what you have (dbt docs?) and see if it's enough

3. If a real catalog is needed, prefer open-source (DataHub) or modern SaaS (Atlan)

4. Invest in adoption; don't just deploy

5. Measure usage; cut scope if it's not used

Further Reading

- [DataModelingFundamentals](DataModelingFundamentals) — What you're cataloging

- [MasterDataManagement](MasterDataManagement) — Adjacent governance

- [DbtAndAnalyticsEngineering](DbtAndAnalyticsEngineering) — dbt's lightweight catalog

- [DataEngineering Hub](DataEngineeringHub) — Cluster index