Knowledge Graph Medallion Architecture DCAT · PROV-O Open Standards

The Semantic Medallion

Building a Knowledge Graph-Powered Data Catalog — how to transform raw data sources into a unified knowledge graph in four lines of Python.

🏗️ Architecture Layers

The medallion layers are reframed as semantic enrichment stages: Bronze (no semantics), Silver (local semantics via identifiers), Gold (global semantics via shared vocabulary) — and the Platinum Layer extends this further into a full Semantic Web.

Bronze
No Semantics
Standard raw data ingestion via orchestration tools. Data is ingested in its original format with no semantic transformation. Linked to raw source datasets via DCAT.
Silver
Local Semantics
Beyond cleaning and typing: stable IRIs are minted for every entity — creating join keys that work across all systems. This is the critical implementation step.
Gold
Global Semantics
Silver DataFrames mapped to a shared ontology via OTTR templates and maplib. Published as RDF. The relationships are in the data, not in join logic.

💠 Platinum Layer — Semantic Web

Platinum
Linked Data Semantics
The RDF-based Knowledge Graph is deployed using Linked Data principles — leveraging the power of hyperlinks for entity and relationship naming. IRIs become dereferenceable, globally unique identifiers, and the knowledge graph becomes a node in the wider Semantic Web.
"And the Platinum Layer is when the RDF based Knowledge Graph is deployed using Linked Data principles i.e., knowledge construction that leverages the power of hyperlinks for entity and relationship naming that manifests a Semantic Web."

💡 Core Concepts

🕸️ Semantic Medallion

Extension of Bronze/Silver/Gold where the Gold layer becomes a connected knowledge graph. Relationships embedded in data, not in external join logic.

🔖 IRI Minting

Creating stable, globally unique identifiers for every entity in the Silver layer — replacing database auto-increment IDs with cross-system identifiers.

🔗 Entity Resolution

Using owl:sameAs to link entities across CRM, billing, and external registries — a unified view of every record across all systems.

📋 Data Provenance

Lineage tracked via PROV-O as queryable RDF triples — no separate lineage tool needed.

📦 Semantic Data Catalog

Metadata and data unified in the same graph. The catalog is part of the knowledge graph — not a separate system pointing at data.

🐍 DataFrame-to-RDF

Map Silver DataFrames to RDF using OTTR templates and maplib. Complete Silver-to-Gold transformation in 4 lines of Python.

Semantic Catalog Capabilities

Entity Resolution Across Sources

Using owl:sameAs to link entities in CRM, billing, and external registries. Query: "Show me everything we know about this record."

Semantic Search via Subclass Hierarchies

Ontology encodes subclass relationships — searching for ComplianceRelated automatically returns ComplianceOfficer, AuditRecord, and RegulatoryFiling subtypes.

Impact Analysis

Trace graph relationships to find downstream dependencies when schemas change — every dataset, report, and application affected by a source modification.

🔍 SPARQL Query Examples

All queries below are executable against the companion ontology and instance data in semantic-medallion-ontology-instance-data.ttl. Click ▼ to expand; the ▶ button opens the query in the URIBurner live SPARQL endpoint.

Retrieve every fact about customer C-001 (Acme Corporation) from every data source, unified under one identifier. Uses the DESCRIBE form to return all triples — name, email, phone, address, revenue, contracts, billing issues, account manager, and owl:sameAs links to CRM and billing representations.
PREFIX : <https://moderndata101.substack.com/p/the-semantic-medallion#> PREFIX schema: <http://schema.org/> PREFIX prov: <http://www.w3.org/ns/prov#> PREFIX dcat: <http://www.w3.org/ns/dcat#> DESCRIBE :C-001 FROM <https://linkeddata.uriburner.com/DAV/demos/daas/the-semantic-medallion.ttl>
Run live query Returns all triples about Acme Corporation (C-001) from the instance data
Trace which sources contribute to customer records — identifying the provenance path from Gold entities back through Silver to Bronze datasets. Uses prov:wasDerivedFrom and prov:hadPrimarySource chains with DCAT distributions.
PREFIX : <https://moderndata101.substack.com/p/the-semantic-medallion#> PREFIX prov: <http://www.w3.org/ns/prov#> PREFIX dcat: <http://www.w3.org/ns/dcat#> SELECT ?source ?dataset ?distribution FROM <https://linkeddata.uriburner.com/DAV/demos/daas/the-semantic-medallion.ttl> WHERE { :C-001 prov:wasDerivedFrom ?source . ?source prov:hadPrimarySource ?dataset . ?dataset dcat:distribution ?distribution . }
Run live query Returns: CRM-Salesforce + Billing-Stripe sources, Bronze raw datasets, and their distributions
Find all customers with open billing issues who also have active contracts — navigating relationships directly across systems without manual joins or table hunting. In a traditional Gold layer this would require SQL joins across multiple Parquet tables.
PREFIX : <https://moderndata101.substack.com/p/the-semantic-medallion#> PREFIX schema: <http://schema.org/> SELECT ?customer ?name ?billingIssue ?contract FROM <https://linkeddata.uriburner.com/DAV/demos/daas/the-semantic-medallion.ttl> WHERE { ?customer a schema:Person ; schema:name ?name ; :hasBillingIssue ?billingIssue ; :hasActiveContract ?contract . ?billingIssue schema:status "Open" . ?contract schema:status "Active" . }
Run live query Returns: Acme Corporation (C-001) + Beta Industries (C-002) — both match the pattern
Inspect the Semantic Medallion ontology by listing all defined classes with their labels and descriptions. Run this against the main KG Turtle file to enumerate the full class hierarchy.
PREFIX : <https://moderndata101.substack.com/p/the-semantic-medallion#> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> SELECT ?class ?label ?comment FROM <https://linkeddata.uriburner.com/DAV/demos/daas/the-semantic-medallion.ttl> WHERE { ?class a rdfs:Class ; rdfs:label ?label ; rdfs:comment ?comment . FILTER(STRSTARTS(STR(?class), "https://moderndata101.substack.com/p/the-semantic-medallion#")) } ORDER BY ?label
Run live query Returns 9 custom classes: BronzeLayer, DataCatalogConnector, DataManagementIndustry, GoldLayer, Industry, KnowledgeGraphConsultingIndustry, MedallionLayer, SemanticEnrichmentProcess, SilverLayer

📋 How to Build a Semantic Medallion Data Catalog

1

Ingest Raw Data into Bronze

Set up orchestration tools to ingest raw data from all source systems. Preserve original formats without transformation. Link datasets to sources via DCAT.

2

Clean Data and Mint IRIs in Silver

Clean and type the data. Critically, replace database auto-increment IDs with stable, globally unique IRIs for every entity. Define a consistent IRI naming scheme using your organisation's domain namespace.

3

Design the Shared Ontology

Map business concepts to RDF classes and properties. Start with existing data models. Use standard vocabularies (DCAT, PROV-O, schema.org) where possible. Iterate and refine as understanding deepens.

4

Define OTTR Templates

Create OTTR templates in stOTTR syntax that map DataFrame columns to ontology properties. Each template defines how a row in your DataFrame becomes a set of RDF triples.

5

Transform to RDF with maplib

Use the maplib Python library to apply OTTR templates to Silver DataFrames. This transformation produces RDF triples in as few as four lines of Python.

6

Publish to Gold as a Knowledge Graph

Write resulting RDF to the Gold layer — as Turtle files on a data lake, or directly to a triplestore with a SPARQL endpoint. Register in a DCAT Catalog with a SPARQL endpoint distribution.

7

Query and Validate

Use SPARQL to validate: run Customer 360 queries to verify entity resolution, lineage queries to trace provenance, cross-system queries to confirm relationships. Iterate on ontology and templates based on results.

🛠️ Technologies & Standards

TechnologyRoleStandard
maplibDataFrame-to-RDF transformation libraryPython / PyPI
OTTR / stOTTRDeclarative templates mapping columns to ontology propertiesReasonable Ontology Templates
DCATData Catalog Vocabulary — catalog, datasets, distributions, servicesW3C Recommendation
PROV-OProvenance ontology — data lineage as queryable triplesW3C Recommendation
SPARQL 1.1Query language for RDF knowledge graphsW3C Recommendation
RDFResource Description Framework — subject-predicate-object triplesW3C Recommendation

🕸️ Knowledge Graph Explorer

Loading…
Class
Property
Instance

Drag to pan · Scroll to zoom · Click a node or edge label to open its entity description in URIBurner (new tab)

FAQ

The Semantic Medallion is an extension of the traditional Bronze/Silver/Gold data lakehouse architecture where the Gold layer is transformed into a connected knowledge graph rather than isolated clean tables. Relationships are embedded in the data itself, not in external join logic, enabling entity resolution, cross-system queries, and semantic search.
In the Silver layer, beyond cleaning and typing data, stable IRIs are minted for every entity. These globally unique identifiers replace database auto-increment IDs and GUIDs, creating join keys that work across all systems rather than being scoped to a single database.
The Gold layer maps Silver DataFrames to a shared ontology using OTTR templates and the maplib Python library, then publishes the result as RDF. Rather than creating separate Parquet/Delta tables requiring manual join logic, the data becomes a connected knowledge graph where relationships are stored as triples.
Yes. The article demonstrates the complete Silver-to-Gold transformation in four lines: (1) load the DataFrame, (2) define an OTTR template mapping columns to ontology properties, (3) apply the template using maplib, and (4) publish the resulting RDF to the Gold layer. The power comes from declarative templates rather than imperative code.
Entity resolution uses owl:sameAs assertions to link entities appearing in CRM, billing, and external registries. A customer in both the sales system and the billing system is linked via an owl:sameAs triple, creating a unified view. SPARQL queries then navigate all facts about that customer regardless of which source system contributed them.
PROV-O expresses data provenance as RDF triples — recording which entities were derived from which sources, which activities generated them, and which agents were responsible. Lineage lives in the knowledge graph itself, queryable with the same SPARQL endpoint as the data.
The biggest challenge is not the RDF conversion itself, but establishing stable identifiers (IRIs) in the Silver layer. This requires upfront design of IRI naming conventions and careful governance to ensure identifiers remain consistent as new data sources are added.
A traditional Gold layer consists of separate Parquet or Delta tables requiring manual join logic in SQL or dbt scripts. The Semantic Gold layer is RDF where relationships are embedded in the data as triples — eliminating table hunting, manual JOINs, and fragile transformation pipelines.

📖 Glossary

Architecture pattern extending Bronze/Silver/Gold with knowledge graph semantics — transforming the Gold layer into connected RDF.
Internationalized Resource Identifier — a globally unique, stable identifier replacing database auto-increment IDs. Works across all systems.
W3C Data Catalog Vocabulary — describing data catalogs, datasets, distributions, and data services. Used by data.gov and the European Data Portal.
W3C Provenance Ontology — representing provenance information as queryable RDF triples within the knowledge graph.
Reasonable Ontology Templates — declarative stOTTR syntax mapping structured data columns to ontology properties for concise DataFrame-to-RDF transformations.
Python library applying OTTR templates to DataFrames to produce RDF — enabling the complete Silver-to-Gold transformation in as few as four lines.
Resource Description Framework — the W3C standard for representing linked data as subject-predicate-object triples. Foundation of the Semantic Gold layer.
W3C query language for RDF knowledge graphs — used to navigate entity relationships, trace lineage, and execute cross-system queries in the Semantic Gold layer.
A graph-structured data model where entities are connected by typed relationships — enabling navigation, reasoning, and semantic search across unified data.
A data lakehouse design pattern with Bronze (raw ingestion), Silver (cleaned and typed with IRIs), and Gold (semantic RDF knowledge graph) layers.

🏭 Industry Context

NAICS 518210
Data Processing, Hosting & Related Services
Labor TAM:~$200B
Automation Readiness:High
NAICS 541511
Custom Computer Programming Services
Labor TAM:~$80B
Automation Readiness:High