KG-SaF: Building Complete and Curated Datasets for Machine Learning and Reasoning on Knowledge Graphs
KG-SaF provides a workflow (KG-SaF-JDeX) and curated datasets (KG-SaF-Data) for knowledge graph refinement (KGR) research. The resource includes datasets with both schema (ontologies) and ground facts, making it ready for machine learning and reasoning services.
- 🗂️ Extracts datasets from RDF-based KGs with expressive schemas (RDFS/OWL2)
- 📦 Provides datasets in OWL and TSV formats, easily loadable in both PyTorch and Protege
- ⚡ Handles inconsistencies and leverages reasoning to infer implicit knowledge
- 🤖 Provides ML-ready tensor representations compatible with PyTorch and PyKEEN
- 🧩 Offers schema decomposition into themed partitions (modularization of ontology components)
The table below lists the currently available ontologies and their corresponding datasets included in this resource.
Note: This table will be updated as new datasets and ontologies become available.
| Ontology | Datasets | DL Fragment |
|---|---|---|
| 📚 DBpedia |
DBPEDIA25-50K-C, DBPEDIA25-100K-C
|
|
| 📚 YAGO3 |
YAGO3-39K-C, YAGO3-10-C
|
|
| 📚 YAGO4 | YAGO4-20-C |
|
| 📚 ArCo |
ARCO25-20, ARCO25-10, ARCO25-5
|
|
| 📚 WHOW | WHOW25-5 |
|
| 📚 ApuliaTravel | ATRAVEL |
All datasets are provided in a standardized format following the Description Logic (DL) formalization, separating the dataset into ABox (instance-level data), TBox (schema-level information), and RBox (roles and properties)
📄 Files marked with this icon are new serializations or variations of the same data already available in OWL format (e.g., TSV or JSON representations), intended for easier use in ML pipelines.
📁 abox ......................................... # Assertional Box (instance-level data)
│ ├── 📁 splits ................................. # Train/test/validation splits
│ │ ├── 🦉 train.nt ............................. # Training triples (N-Triples)
│ │ ├── 🦉 valid.nt ............................. # Validation triples (N-Triples)
│ │ ├── 🦉 test.nt .............................. # Test triples (N-Triples)
│ │ ├── 📄 train.tsv ............................ # Training triples (TSV)
│ │ ├── 📄 valid.tsv ............................ # Validation triples (TSV)
│ │ └── 📄 test.tsv ............................. # Test triples (TSV)
│ │
│ ├── 🦉 individuals.owl ........................ # Individuals definitions
│ ├── 🦉 class_assertions.owl ................... # Individuals class assertions (OWL)
│ ├── 📄 class_assertions.json .................. # Individuals class assertions (JSON)
│ │
│ ├── 🦉 obj_prop_assertions.nt ................. # Combined triples (N-Triples)
│ └── 📄 obj_prop_assertions.tsv ................ # Combined triples (TSV)
📁 rbox ......................................... # Role Box (relations and properties)
│ ├── 🦉 roles.owl .............................. # Role definitions
│ ├── 📄 roles_domain_range.json ................ # Domain and range of roles (JSON)
│ └── 📄 roles_hierarchy.json ................... # Role hierarchy (JSON)
📁 tbox ......................................... # Terminological Box (schema-level info)
│ ├── 🦉 classes.owl ............................ # Class non-taxonomical axioms
│ ├── 🦉 taxonomy.owl ........................... # Hierarchical taxonomy
│ └── 📄 taxonomy.json .......................... # Hierarchical taxonomy (JSON)
🦉 knowledge_graph.owl .......................... # Full merged TBox + RBox + ABox
🦉 ontology.owl ................................. # Core modularized schema
📁 mappings ..................................... # Mappings to IDs
│ ├── 🧾 class_to_id.json ....................... # Map ontology classes to IDs
│ ├── 🧾 individual_to_id.json .................. # Map entities/instances to IDs
│ └── 🧾 object_property_to_id.json ............. # Map object properties to IDs
Before using the datasets, you must run the provided dataset unpacking notebook. This step is required because, due to storage limitations, some secondary files were removed from the distributed datasets. The script automates the following tasks:
- Unpacking all compressed datasets and ontologies into an
unpackfolder. - Re-merging object property assertion files for each dataset.
- Merging the full knowledge graph (TBox, RBox, and ABox) using a reasoner (Robot OBO Tool).
- Converting N-Triples files to TSV format, making them ready for use with ML libraries such as PyKEEN.
- Converting Schema files to JSON (e.g., class assertions, taxonomy, role hierarchies) for easier loading and manipulation in Python.
Open the notebook and run all cells sequentially. After execution, each dataset folder will contain:
- Fully merged knowledge graph (
knowledge_graph.owl) - Object property assertions (
obj_prop_assertions.ntand.tsv) - Training, test, and validation splits in TSV format (
train.tsv,test.tsv,valid.tsv) - Taxonomy, Roles, and Class Assertion in JSON format (
taxonomy.json,roles_domain_range.json,roles_hierarchy.json,class_assetions.json)
In the tutorial folder, we provide example notebooks demonstrating how to use KG-SaF datasets and tools.
-
Loading a PyTorch dataset using the custom
KnowledgeGraphclass- File:
tutorial/dataset_loader.ipynb - Description: Shows how to load a dataset from KG-SaF into PyTorch tensors using the
KnowledgeGraphclass, including train/test/validation splits and schema-aware representations.
- File:
-
Proof of concept: Using PyKEEN for machine learning on KG-SaF datasets
- File:
tutorial/kge_pykeen.ipynb - Description: Demonstrates a basic pipeline for training a Knowledge Graph Embedding (KGE) model using PyKEEN on one of the KG-SaF datasets, including evaluation.
- File: