Building a Knowledge Graph-Powered Data Catalog — how to transform raw data sources into a unified knowledge graph in four lines of Python.
The medallion layers are reframed as semantic enrichment stages: Bronze (no semantics), Silver (local semantics via identifiers), Gold (global semantics via shared vocabulary) — and the Platinum Layer extends this further into a full Semantic Web.
Extension of Bronze/Silver/Gold where the Gold layer becomes a connected knowledge graph. Relationships embedded in data, not in external join logic.
Creating stable, globally unique identifiers for every entity in the Silver layer — replacing database auto-increment IDs with cross-system identifiers.
Using owl:sameAs to link entities across CRM, billing, and external registries — a unified view of every record across all systems.
Lineage tracked via PROV-O as queryable RDF triples — no separate lineage tool needed.
Metadata and data unified in the same graph. The catalog is part of the knowledge graph — not a separate system pointing at data.
Map Silver DataFrames to RDF using OTTR templates and maplib. Complete Silver-to-Gold transformation in 4 lines of Python.
Using owl:sameAs to link entities in CRM, billing, and external registries. Query: "Show me everything we know about this record."
Ontology encodes subclass relationships — searching for ComplianceRelated automatically returns ComplianceOfficer, AuditRecord, and RegulatoryFiling subtypes.
Trace graph relationships to find downstream dependencies when schemas change — every dataset, report, and application affected by a source modification.
All queries below are executable against the companion ontology and instance data in semantic-medallion-ontology-instance-data.ttl. Click ▼ to expand; the ▶ button opens the query in the URIBurner live SPARQL endpoint.
C-001 (Acme Corporation) from every data source, unified under one identifier. Uses the DESCRIBE form to return all triples — name, email, phone, address, revenue, contracts, billing issues, account manager, and owl:sameAs links to CRM and billing representations.prov:wasDerivedFrom and prov:hadPrimarySource chains with DCAT distributions.Set up orchestration tools to ingest raw data from all source systems. Preserve original formats without transformation. Link datasets to sources via DCAT.
Clean and type the data. Critically, replace database auto-increment IDs with stable, globally unique IRIs for every entity. Define a consistent IRI naming scheme using your organisation's domain namespace.
Map business concepts to RDF classes and properties. Start with existing data models. Use standard vocabularies (DCAT, PROV-O, schema.org) where possible. Iterate and refine as understanding deepens.
Create OTTR templates in stOTTR syntax that map DataFrame columns to ontology properties. Each template defines how a row in your DataFrame becomes a set of RDF triples.
Use the maplib Python library to apply OTTR templates to Silver DataFrames. This transformation produces RDF triples in as few as four lines of Python.
Write resulting RDF to the Gold layer — as Turtle files on a data lake, or directly to a triplestore with a SPARQL endpoint. Register in a DCAT Catalog with a SPARQL endpoint distribution.
Use SPARQL to validate: run Customer 360 queries to verify entity resolution, lineage queries to trace provenance, cross-system queries to confirm relationships. Iterate on ontology and templates based on results.
| Technology | Role | Standard |
|---|---|---|
| maplib | DataFrame-to-RDF transformation library | Python / PyPI |
| OTTR / stOTTR | Declarative templates mapping columns to ontology properties | Reasonable Ontology Templates |
| DCAT | Data Catalog Vocabulary — catalog, datasets, distributions, services | W3C Recommendation |
| PROV-O | Provenance ontology — data lineage as queryable triples | W3C Recommendation |
| SPARQL 1.1 | Query language for RDF knowledge graphs | W3C Recommendation |
| RDF | Resource Description Framework — subject-predicate-object triples | W3C Recommendation |
Drag to pan · Scroll to zoom · Click a node or edge label to open its entity description in URIBurner (new tab)
owl:sameAs assertions to link entities appearing in CRM, billing, and external registries. A customer in both the sales system and the billing system is linked via an owl:sameAs triple, creating a unified view. SPARQL queries then navigate all facts about that customer regardless of which source system contributed them.