KG-SaF: Building Complete and Curated Datasets for Machine Learning and Reasoning on Knowledge Graphs

KG-SaF provides a workflow (KG-SaF-JDeX) and curated datasets (KG-SaF-Data) for knowledge graph refinement (KGR) research. The resource includes datasets with both schema (ontologies) and ground facts, making it ready for machine learning and reasoning services.

Key Features

🗂️ Extracts datasets from RDF-based KGs with expressive schemas (RDFS/OWL2)
📦 Provides datasets in OWL and TSV formats, easily loadable in both PyTorch and Protege
⚡ Handles inconsistencies and leverages reasoning to infer implicit knowledge
🤖 Provides ML-ready tensor representations compatible with PyTorch and PyKEEN
🧩 Offers schema decomposition into themed partitions (modularization of ontology components)

Dataset Documentation (KG-SaF-Data)

Available Ontologies (Schema) and Datasets

The table below lists the currently available ontologies and their corresponding datasets included in this resource.

Note: This table will be updated as new datasets and ontologies become available.

Ontology	Datasets	DL Fragment
📚 DBpedia	`DBPEDIA25-50K-C`, `DBPEDIA25-100K-C`	$\mathcal{ALCHF}$
📚 YAGO3	`YAGO3-39K-C`, `YAGO3-10-C`	$\mathcal{ALHIF+}$
📚 YAGO4	`YAGO4-20-C`	$\mathcal{ALCHIF}$
📚 ArCo	`ARCO25-20`, `ARCO25-10`, `ARCO25-5`	$\mathcal{SROIQ}$
📚 WHOW	`WHOW25-5`	$\mathcal{SROIQ}$
📚 ApuliaTravel	`ATRAVEL`	$\mathcal{SRIQ}$

Dataset File Structure

All datasets are provided in a standardized format following the Description Logic (DL) formalization, separating the dataset into ABox (instance-level data), TBox (schema-level information), and RBox (roles and properties)

📄 Files marked with this icon are new serializations or variations of the same data already available in OWL format (e.g., TSV or JSON representations), intended for easier use in ML pipelines.

📁 abox ......................................... # Assertional Box (instance-level data)
│ ├── 📁 splits ................................. # Train/test/validation splits
│ │ ├── 🦉 train.nt ............................. # Training triples (N-Triples)
│ │ ├── 🦉 valid.nt ............................. # Validation triples (N-Triples)
│ │ ├── 🦉 test.nt .............................. # Test triples (N-Triples)
│ │ ├── 📄 train.tsv ............................ # Training triples (TSV)
│ │ ├── 📄 valid.tsv ............................ # Validation triples (TSV)
│ │ └── 📄 test.tsv ............................. # Test triples (TSV)
│ │
│ ├── 🦉 individuals.owl ........................ # Individuals definitions
│ ├── 🦉 class_assertions.owl ................... # Individuals class assertions (OWL)
│ ├── 📄 class_assertions.json .................. # Individuals class assertions (JSON)
│ │
│ ├── 🦉 obj_prop_assertions.nt ................. # Combined triples (N-Triples)
│ └── 📄 obj_prop_assertions.tsv ................ # Combined triples (TSV)

📁 rbox ......................................... # Role Box (relations and properties)
│ ├── 🦉 roles.owl .............................. # Role definitions
│ ├── 📄 roles_domain_range.json ................ # Domain and range of roles (JSON)
│ └── 📄 roles_hierarchy.json ................... # Role hierarchy (JSON)

📁 tbox ......................................... # Terminological Box (schema-level info)
│ ├── 🦉 classes.owl ............................ # Class non-taxonomical axioms
│ ├── 🦉 taxonomy.owl ........................... # Hierarchical taxonomy
│ └── 📄 taxonomy.json .......................... # Hierarchical taxonomy (JSON)

🦉 knowledge_graph.owl .......................... # Full merged TBox + RBox + ABox
🦉 ontology.owl ................................. # Core modularized schema

📁 mappings ..................................... # Mappings to IDs
│ ├── 🧾 class_to_id.json ....................... # Map ontology classes to IDs
│ ├── 🧾 individual_to_id.json .................. # Map entities/instances to IDs
│ └── 🧾 object_property_to_id.json ............. # Map object properties to IDs

Dataset Unpacking

Before using the datasets, you must run the provided dataset unpacking notebook. This step is required because, due to storage limitations, some secondary files were removed from the distributed datasets. The script automates the following tasks:

Unpacking all compressed datasets and ontologies into an unpack folder.
Re-merging object property assertion files for each dataset.
Merging the full knowledge graph (TBox, RBox, and ABox) using a reasoner (Robot OBO Tool).
Converting N-Triples files to TSV format, making them ready for use with ML libraries such as PyKEEN.
Converting Schema files to JSON (e.g., class assertions, taxonomy, role hierarchies) for easier loading and manipulation in Python.

Open the notebook and run all cells sequentially. After execution, each dataset folder will contain:

Fully merged knowledge graph (knowledge_graph.owl)
Object property assertions (obj_prop_assertions.nt and .tsv)
Training, test, and validation splits in TSV format (train.tsv, test.tsv, valid.tsv)
Taxonomy, Roles, and Class Assertion in JSON format (taxonomy.json, roles_domain_range.json, roles_hierarchy.json, class_assetions.json)

Tutorials

In the tutorial folder, we provide example notebooks demonstrating how to use KG-SaF datasets and tools.

Loading a PyTorch dataset using the custom KnowledgeGraph class
- File: tutorial/dataset_loader.ipynb
- Description: Shows how to load a dataset from KG-SaF into PyTorch tensors using the KnowledgeGraph class, including train/test/validation splits and schema-aware representations.
Proof of concept: Using PyKEEN for machine learning on KG-SaF datasets
- File: tutorial/kge_pykeen.ipynb
- Description: Demonstrates a basic pipeline for training a Knowledge Graph Embedding (KGE) model using PyKEEN on one of the KG-SaF datasets, including evaluation.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.docs		.docs
kgsaf_data		kgsaf_data
kgsaf_jdex		kgsaf_jdex
tutorial		tutorial
.gitignore		.gitignore
CITATION.cff		CITATION.cff
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.md		LICENSE.md
README.md		README.md
REQUIREMENTS.md		REQUIREMENTS.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

KG-SaF: Building Complete and Curated Datasets for Machine Learning and Reasoning on Knowledge Graphs

Key Features

Dataset Documentation (KG-SaF-Data)

Available Ontologies (Schema) and Datasets

Dataset File Structure

Dataset Unpacking

Tutorials

About

Uh oh!

Releases 7

Languages

License

ivandiliso/kg-saf

Folders and files

Latest commit

History

Repository files navigation

KG-SaF: Building Complete and Curated Datasets for Machine Learning and Reasoning on Knowledge Graphs

Key Features

Dataset Documentation (KG-SaF-Data)

Available Ontologies (Schema) and Datasets

Dataset File Structure

Dataset Unpacking

Tutorials

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 7

Languages