mdm-and-federated-data-governance

star 1

Apply Master Data Management (MDM) styles (Consolidation, Registry, Centralized, Coexistence), federated governance via data contracts and policy automation, data catalog + metalake architecture, knowledge graphs for metadata, semantic layers, and access control models (ACL, RBAC, ABAC + PEP/PDP/PIP/PAP). Use when scoping MDM, choosing an MDM style, designing a data catalog, building governance automation, defining data contracts, or implementing fine-grained access control on data products. Triggers: "MDM strategy", "consolidation vs registry vs centralized vs coexistence", "data contract", "data catalog", "knowledge graph for metadata", "ABAC for data", "semantic layer for governance", "metalake". Produces a chosen MDM style + governance architecture with policy automation.

AlexYedi By AlexYedi schedule Updated 5/19/2026

name: mdm-and-federated-data-governance description: > Apply Master Data Management (MDM) styles (Consolidation, Registry, Centralized, Coexistence), federated governance via data contracts and policy automation, data catalog + metalake architecture, knowledge graphs for metadata, semantic layers, and access control models (ACL, RBAC, ABAC + PEP/PDP/PIP/PAP). Use when scoping MDM, choosing an MDM style, designing a data catalog, building governance automation, defining data contracts, or implementing fine-grained access control on data products. Triggers: "MDM strategy", "consolidation vs registry vs centralized vs coexistence", "data contract", "data catalog", "knowledge graph for metadata", "ABAC for data", "semantic layer for governance", "metalake". Produces a chosen MDM style + governance architecture with policy automation.

MDM and Federated Data Governance

You apply Strengholt's catalog of Master Data Management styles, federated governance patterns, and metadata architectures. The discipline: maintain authoritative master records, automate policy enforcement at the data product level, and surface data through a catalog + semantic layer that consumers can trust.


When to Use This Skill

  • Scoping Master Data Management (MDM) for a multi-domain enterprise
  • Choosing among the four MDM styles
  • Designing a data catalog (or evaluating one)
  • Building automated, federated governance for a Data Mesh
  • Defining data contracts between providers and consumers
  • Implementing fine-grained access control on data products
  • Considering a knowledge graph for metadata (metalake)

The Four MDM Styles

LIGHT TOUCH                                                     HEAVY TOUCH
───────────                                                     ───────────

Registry        Consolidation       Coexistence         Centralized
Style           Style                Style               Style

Cross-          Hub aggregates       Improvements        Transactional
reference       for analytics;       flow back to        hub; cleansed +
table only      no modification      sources             republished
                of sources           bidirectionally     two-way sync

Lowest impact;  Easy analytics;      Authoritative       Most control;
Lowest control  Lowest source        across systems;     Highest impact
                impact               Some source impact  on sources

1. Registry Style

Mechanics: A master ID table maps records across systems. Sources unchanged.

When: Quick visibility into duplicates / inconsistencies. Discovery phase. No appetite for system changes.

Pros: Lowest impact. Fastest to deploy. Cons: Doesn't fix data quality at source. Cross-system consistency stays manual.

2. Consolidation Style

Mechanics: MDM hub consolidates master data from sources for analytics / reporting. Read-only relative to sources — no propagation back.

When: Single source of truth for downstream analytics. Sources too critical / fragile to modify.

Pros: Low source impact. Clean analytics. Cons: Operational systems still inconsistent. Drift between sources accumulates.

3. Coexistence Style

Mechanics: Improvements found in MDM hub flow back to sources via complex integration. Sources become consistent over time.

When: Authoritative records needed across operational systems. Willing to invest in bidirectional integration.

Pros: Operational consistency. Cons: Complex; sources must accept updates from MDM.

4. Centralized Style

Mechanics: Transactional hub. Master data is created and modified in MDM, then published to source systems via two-way sync.

When: Strict regulatory requirement for single source. Greenfield rebuild possible.

Pros: Highest control. Single point of truth. Cons: Highest impact. Often impractical to retrofit.


Choosing an MDM Style — Pragmatic Path

Most enterprises follow this maturation:

Start ──► Registry ──► Consolidation ──► (sometimes) Coexistence ──► Centralized
         (visibility)  (analytics-       (operational    (only if
                        clean)            consistency)    greenfield or
                                                          regulatory)

Strengholt's heuristic: Don't try to unify all enterprise data. Select only the stable, critical, broadly-shared data elements for MDM scope. Customer, Product, Account — these are typical. Transaction-level data is usually not MDM scope.

Practical Tactics

  • Master identifier centrally. Issue unique, immutable IDs. Map to source-system local IDs.
  • Bake data quality in. MDM hub validates on ingest. Reject or flag bad records.
  • Define ownership early. Every master record has a domain owner. Modifications go through them.
  • Start narrow. 5 critical master entities. Expand as you learn.

The Data Catalog and Metalake Architecture

Data Catalog

An inventory of data products with metadata:

  • Business terms (glossary)
  • Owners
  • Origins / lineage
  • Classifications (PII, SOX, GDPR-restricted)
  • SLAs (freshness, quality, availability)
  • Schema + sample
  • Access requests

Tools: DataHub (open-source), Atlan, Collibra, Alation, Apache Atlas.

Metalake Architecture

                      [Marketplace Layer]
                       (visualization +
                        consumption UI)
                              ▲
                              │
                       [Knowledge Graph]
                       (semantic relationships
                        across all metadata)
                              ▲
                              │
                      [Processed Zone]
                      (cleansed, integrated metadata)
                              ▲
                              │
                      [Landing Zone]
                      (raw metadata from
                       all sources)

The idea: Treat metadata itself as data. Apply lakehouse architecture (Bronze / Silver / Gold) to metadata. Use a knowledge graph as the unified semantic layer.

When valuable: Large enterprise with hundreds of data products and complex relationships. Heavy compliance load.

Caveat: Heavy. Most companies don't need it. A single catalog tool with manual relationships is enough until it isn't.


Knowledge Graphs for Metadata

A knowledge graph models data products and their relationships as nodes + edges in a semantic graph (RDF, OWL, SPARQL — or a property-graph DB).

[Customer Data Product]
    │
    ├─owned_by──► [Customer Domain]
    ├─complies_with──► [GDPR]
    ├─derived_from──► [CRM Source System]
    ├─produces──► [Customer 360 View]
    └─used_by──► [Marketing Campaign Service]

Why use it:

  • Cross-cuts traditional table/document silos
  • Powers "show me everything that depends on this" queries
  • Enables federated semantic search

Implementation options:

  • RDF/SPARQL: GraphDB, Stardog. Pure semantic web; standards-aligned.
  • Property graph: Neo4j, Amazon Neptune. Easier to operate.
  • Both via gateway: Some platforms support both queries.

When to bother: Heavy metadata complexity (regulatory, multi-source). Otherwise, a relational catalog is sufficient.


Semantic Layer (Beyond BI Metrics)

Earlier covered as a metric-definition layer for BI. Extends to:

  • Business glossary: "Customer" defined once; consistent across all surfaces
  • Data products: Logical entity in metamodel; multiple physical implementations OK
  • Lineage: Semantic, not just column-level
  • Access policies: Bound to semantic roles, not physical tables

Strengholt's framing: Define data products as logical entities in a metamodel, linked to glossary terms and technical attributes. Provides flexibility — physical implementation can change without metadata churn.


Federated Computational Governance

Three components, working together:

1. Data Contracts

Provider commits to:

  • Schema (versioned)
  • SLA (freshness, quality, availability)
  • Classifications (PII, sensitivity)

Consumer commits to:

  • Acceptable usage
  • Notification of breaking-change needs
  • Compliance with classifications

Tooling: Bitol Project's Open Data Contract Standard (ODCS), Data Contract CLI, custom YAML in dbt, etc.

Practical: Treat data contract YAML files like API contracts. Version control. PRs. Breaking changes follow deprecation cycle.

2. Policy Automation (ABAC / OPA)

ACCESS CONTROL EVOLUTION:

ACL (Access Control Lists)
   ↓
RBAC (Role-Based Access Control)
   ↓
ABAC (Attribute-Based Access Control)

ABAC architecture:

[Subject] ──request──► [Policy Enforcement Point (PEP)]
                              │
                              ▼
                       [Policy Decision Point (PDP)]
                              │
                              ├──queries──► [Policy Information Point (PIP)]
                              │
                              ├──reads──► [Policy Administration Point (PAP)]
                              │
                              ▼
                       [Allow / Deny]

Components:

  • PEP: Where access is enforced. Often the data plane (warehouse, API gateway).
  • PDP: Evaluates policies. Open Policy Agent (OPA) is the canonical OSS implementation.
  • PIP: Provides data for decision (e.g., user attributes, data classifications).
  • PAP: Where policies are authored and registered. Often a UI or Git repo.

Practical: OPA + Rego policies. Domain-team-authorable. Version-controlled. Auto-applied at every data touch.

3. Data Contract Application (DCA)

A standalone application that acts as the PAP + PIP for data contracts:

  • Domains register their products + contracts
  • Consumers register usage agreements
  • DCA exposes them to PEPs (warehouses, gateways) for enforcement

Benefit: Self-serve governance. Domains don't ask a central committee; they register a contract.


Domain Data Stores (DDS)

Distinct from data products: DDS is a consumer-side store that ingests, transforms, and stores data for a specific use case (a department's reporting model, a feature store for an ML team).

Key distinction:

  • Data Product: Stable, owned by source domain, contract-bound
  • Domain Data Store: Specific to a consumer, may be transient, owned by consumer

Both are valid. The mistake is treating one as the other.


Principles

  • Define ownership before the workflow. A new dataset without an owner is a future fire.
  • Master data is narrow by design. Don't try to MDM everything. 5-10 entities.
  • Bake data quality in at the producer. Pushing it downstream multiplies effort.
  • Data contracts are API contracts. Version, deprecate, communicate.
  • Federated governance via automation. Manual approvals don't scale.
  • Glossary first; tooling second. A shared business glossary is the foundation of every other catalog feature.
  • Semantic > syntactic. Metadata should describe meaning, not just structure.
  • Logical data product > physical table. Allows underlying technology to change.

Anti-Patterns

MDM as Big Bang

Looks like: "Let's MDM everything." 18-month project. Scope creep. Eventually canceled.

Why it fails: MDM scope explodes when undisciplined.

The fix: Start with Registry style. 5 entities. 90 days to first value. Expand from there.

Catalog Without Owners

Looks like: Adopting DataHub/Atlan/Collibra. Loading metadata. Nobody owns the entries. They go stale.

Why it fails: A catalog is a maintenance commitment. Without owners, it's a graveyard.

The fix: Every entry has an owner. Owners review quarterly. Stale entries archived.

RBAC When ABAC is Needed

Looks like: "Marketing role can read all marketing data" — but compliance requires PII to be tokenized for non-EU teams.

Why it fails: Role-based can't express attribute-based constraints.

The fix: ABAC for fine-grained policies. PEP at data plane. OPA for policy logic.

Data Contracts as Documentation Only

Looks like: YAML data contract files. Nobody enforces. Contracts drift from reality.

Why it fails: Contracts must be enforced or they're theater.

The fix: CI checks. Schema validation at ingestion. Breaking changes blocked at PR.

Semantic Layer Without Business Buy-in

Looks like: Engineering builds a semantic layer. Business teams keep their own metric definitions in spreadsheets.

Why it fails: Two sources of truth. Engineering's semantic layer is rejected.

The fix: Co-author semantic definitions with business. Migrate spreadsheet metrics through co-design, not over-the-wall.

Knowledge Graph as First Move

Looks like: Adopt RDF + SPARQL + GraphDB before having a working catalog.

Why it fails: Heavy stack with steep learning curve. Premature optimization.

The fix: Start with relational catalog. Adopt knowledge graph when complexity genuinely warrants.

Treating Domain Data Stores as Products

Looks like: Marketing builds a "marketing data store" for its reporting. Other teams query it as if it's a stable data product.

Why it fails: DDS isn't a contract-bound product. Schema changes silently. Consumers break.

The fix: Distinguish DDS from data products explicitly. Products have contracts; DDS doesn't.


Decision Rules

Situation Action
First MDM effort Registry style. 5 entities. 90-day target.
MDM goal: clean analytics Consolidation style.
MDM goal: cross-system operational consistency Coexistence (if sources accept) or Centralized (greenfield).
Greenfield with strict regulatory Centralized MDM possible if app-layer cooperation exists.
Scoping MDM Pick stable, broadly-shared, critical entities. Reject transactional.
Data catalog adoption Pick one tool (DataHub if OSS-leaning). Establish ownership process before loading metadata.
Multi-team metric drift Adopt semantic layer (LookML / dbt Semantic Layer / Cube). Migrate spreadsheet metrics via co-design.
Fine-grained access requirement ABAC. OPA + Rego. PEP at data plane.
Cross-system metadata complexity Knowledge graph; otherwise stick with relational catalog.
Provider/consumer drift complaints Adopt formal data contracts. Enforce in CI.
New cross-domain data product Define owner first. Then schema. Then contract. Then implement.
Existing PII spread Tokenize at ingest. ABAC policy: only specific roles see raw.
Compliance audit incoming Catalog must show: data inventory, owners, classifications, lineage, access logs.
Domain wants its own store DDS — but explicit it's not a data product (no cross-domain contract).

Worked Example: Federated Governance for a Pharma's Data Mesh

Context: Global pharma, regulatory-heavy. Multiple business units (Research, Clinical Trials, Commercial, Supply Chain). 30 data products envisioned.

Architecture:

Component Choice
MDM style for "Customer" / "Product" / "Site" Coexistence — research and commercial both modify; both must converge
MDM style for transactional data Not in MDM scope
Catalog DataHub OSS — extensible, REST API for integration
Knowledge graph for metadata Yes — pharma's regulatory complexity warrants. Stardog over Neptune (RDF/SPARQL native).
Data Contract format Open Data Contract Standard (ODCS). Stored in Git.
Policy enforcement OPA at the warehouse boundary (Snowflake row + column policies driven by OPA).
PEP locations Warehouse, API gateway, BI tool query layer.
PAP DataHub (catalog) + GitOps repo (Rego policies).
PIP Identity provider (Okta) for user attributes; DataHub for data classifications.
Semantic layer dbt Semantic Layer for metrics; pharma-specific glossary in DataHub.
Lineage OpenLineage emitted from Airflow/dbt; ingested into DataHub + knowledge graph.
Access flow User requests access → DataHub portal → policy auto-evaluates → JIT grant or human review → audit.

First quarter scope: 5 master entities. 8 highest-value data products. Full ABAC-enforced.

Why it works: Heavy regulatory load justifies the metalake + knowledge graph investment. Without that load, this would be over-engineering.

Lesson: Metalake / knowledge graph approach matches regulated industries where metadata complexity dominates. Lighter approach for less-regulated contexts.


Gotchas

  • MDM is a multi-year journey, not a project. Plan for ongoing investment.
  • Data contracts only work with cultural buy-in. YAML files alone won't change behavior.
  • OPA performance matters. Per-query policy evaluation can add latency. Cache decisions; batch policies.
  • Knowledge graphs require ongoing curation. Outdated nodes / relationships become misleading.
  • Catalog adoption is the hardest part. Tooling is the easy part. Owner enrollment is the hard part.
  • Semantic layer + LookML lock-in: moving off Looker is non-trivial if all metrics are LookML.
  • DCA + ABAC can become governance bottlenecks if approval workflows are heavy. Default to auto-grant for compliant requests.
  • MDM "single source of truth" is nuanced. "Single source of definition" is the goal; physical implementations may vary by use case.
  • Consolidation MDM still leaves operational drift. Decide if that's acceptable; if not, you need Coexistence or Centralized.

Further Reading

  • Data Management at Scale (Strengholt), Chapters 5-7
  • Building the Data Lakehouse by Inmon, Levins, Srivastava — for metalake / catalog approaches
  • Open Policy Agent (OPA) documentation — Rego language and integration patterns
  • Data Contracts by Andrew Jones (O'Reilly, 2024) — the canonical contracts reference
  • Building Knowledge Graphs by Hofer & Bonatti — semantic web foundations
  • See architecture/data-mesh-topologies for the topology context where this governance operates
  • See architecture/integration-patterns for the integration patterns governance constrains
  • See governance-and-quality/data-quality-auditor for the audit-side companion — when policy is set here, that skill is what you run to confirm the data actually meets the contract.

Source: Data Management at Scale (Strengholt), Chapters 5-7 (MDM, Governance, Metadata).

Install via CLI
npx skills add https://github.com/AlexYedi/alex-agents-skills --skill mdm-and-federated-data-governance
Repository Details
star Stars 1
call_split Forks 0
navigation Branch main
article Path SKILL.md
More from Creator