Data Catalog Tools
A data catalog is metadata about data. What datasets exist; what's in them; who owns them; how they're used; what they connect to. For organizations with dozens of data systems and many users, catalogs are essential. For smaller setups, often overkill.
This page covers what catalogs actually do and when they're worth deploying.
What a catalog provides
Discovery
"Does this data exist somewhere?" "Who else has worked with customer churn data?" Without a catalog, finding existing data is tribal knowledge.
Lineage
"Where does the customer_revenue table come from?" Trace upstream to sources. Trace downstream to dashboards and reports.
For debugging ("why did this number change?") and impact analysis ("if I change this, what breaks?"), lineage is critical.
Documentation
What does each column mean? What's the granularity? What's the freshness? Catalogs centralize this.
Ownership
Who owns this dataset? Who do I ask if it's broken? Without explicit ownership, data is everyone's problem and nobody's.
Usage
Who's querying this dataset? How often? Useful for understanding what to deprecate, what to optimize.
Governance
Tags for sensitivity (PII, confidential), retention rules, access controls. Often integrated with security systems.
The major tools
DataHub (LinkedIn)
Open-source. Comprehensive: lineage, documentation, ownership, usage. Active community.
Strengths: full-featured; flexible.
Weaknesses: complex to deploy and operate.
Amundsen (Lyft)
Open-source. Search-first design. Lighter than DataHub.
Strengths: simpler; easier to deploy.
Weaknesses: less feature-rich.
Atlan
Commercial SaaS. Modern UX; strong on collaboration. Popular with analytics teams.
Alation
Commercial. Enterprise-focused. Strong governance features.
Collibra
Commercial. Heavy governance focus. Common in regulated industries.
Apache Atlas
Open-source. Hadoop-era origins; still used in some stacks.
dbt Cloud / dbt docs
Lightweight catalog scoped to dbt models. Sufficient for dbt-centric teams.
When a catalog is essential
- 50+ datasets across multiple systems
- Many users (analysts, data scientists, engineers) consuming data
- Compliance requirements (data sensitivity tracking)
- Multi-team data ownership
- Lineage tracking for impact analysis
When it's overkill
- Small data team (<5 people) with shared knowledge
- Single warehouse with stable structure
- dbt's built-in docs sufficient
- Catalog adoption requires more effort than the value provided
The honest reality: many data catalogs become shelfware. Deployed; not maintained; not used. The metadata is stale; users don't trust it; the catalog adds nothing.
What makes catalogs work in practice
Automated metadata extraction
Manual catalog maintenance fails. The catalog must pull metadata from sources automatically:
- Schema from databases
- Lineage from dbt, Airflow
- Usage from query logs
- Ownership from tags or org structure
Integrated, not separate
The catalog should integrate with daily tools:
- IDE plugins for SQL editors
- Slack notifications for schema changes
- Dashboard tool integration
A catalog that requires people to leave their tools is rarely used.
Active stewardship
Datasets need owners who maintain documentation. The catalog tracks accountability; the work is human.
Search that actually works
Most users search the catalog. If search is bad, the catalog is unused.
Trust
Users trust the catalog when its information is current and accurate. Trust takes time to build; one wrong piece of information loses it.
Adoption patterns
For organizations adopting a catalog:
Start with one team
Pilot with the analytics team. Get feedback; iterate. Don't roll out company-wide before the tool works for one team.
Automate from the start
Don't ask people to manually populate the catalog. They won't. Automate metadata extraction.
Define ownership
Every dataset gets an owner before it's catalogued. Without ownership, the catalog has no maintainer.
Measure usage
Are people actually using it? Search counts, page views, edits. Low usage = the catalog isn't valuable; investigate.
Iterate
Catalog tools evolve. The catalog itself should evolve. Add features as needed; remove things nobody uses.
Common failure patterns
- **Catalog without automation.** Manual maintenance fails.
- **Catalog without ownership.** Nobody maintains.
- **Catalog without search.** Nobody can find anything.
- **Catalog as project.** Deploy and walk away. Becomes shelfware.
- **Buying enterprise platform when dbt docs would suffice.** Over-engineered.
- **Underfunded catalog initiative.** Half-deployed; never finished.
A reasonable approach
For most organizations:
1. Determine if you actually need a catalog (size, complexity)
2. Start with what you have (dbt docs?) and see if it's enough
3. If a real catalog is needed, prefer open-source (DataHub) or modern SaaS (Atlan)
4. Invest in adoption; don't just deploy
5. Measure usage; cut scope if it's not used
Further Reading
- [DataModelingFundamentals](DataModelingFundamentals) — What you're cataloging
- [MasterDataManagement](MasterDataManagement) — Adjacent governance
- [DbtAndAnalyticsEngineering](DbtAndAnalyticsEngineering) — dbt's lightweight catalog
- [DataEngineering Hub](DataEngineeringHub) — Cluster index