Build a probabilistic schema for a subset of a knowledge graph. The subset is defined by a SPARQL graph pattern, and the schema characterises the properties and values of the matching entities: occurrence distributions, value type breakdowns, numeric statistics, and top frequent values. Can be used on local files (only CLI) and SPARQL endpoints.
Disclaimer: This tool relies on heavy SPARQL queries, and was mostly tested on local SPARQL endpoints (i.e. on SPARQL endpoints from locally deployed triplestores). It will take a lot of time on files, and also, very likely, on remote SPARQL endpoint where it might not even complete. Online SPARQL endpoints having availability issues, we discourage you to try it with those unless you are sure the one you want to inspect can handle it.
pip install rdflib pandas SPARQLWrapper flaskpython build_pschema.py <source> "<pattern>" [--hops N] <output_file>
| Argument | Description |
|---|---|
source |
Path to a local RDF file or URL of a SPARQL endpoint |
pattern |
SPARQL graph pattern using ?x as the main variable |
--hops N |
Number of hops to explore (default: 2) |
output_file |
Path for the output JSON file |
Local RDF file, entities of type schema:Person:
python build_pschema.py data.ttl "?x a <http://schema.org/Person>" --hops 2 schema.jsonRemote SPARQL endpoint, Wikidata humans:
python build_pschema.py https://query.wikidata.org/sparql \
"?x wdt:P31 wd:Q5" --hops 1 schema.jsonThe output is a JSON file produced by summarizeSchema. For each property and inverse property of the matching entities it includes:
occurrences— frequency distribution of the number of values per entity (e.g.{"0": 0.05, "1": 0.88, "2": 0.07}), or descriptive statistics (avg, std, median, min, max) when cardinality is highly variabletypes— frequency breakdown of value types (RDF class or XSD datatype)values— one of:{"type": "numeric", "avg": …, "std": …, "median": …, "min": …, "max": …}{"type": "categorical", "top10": {"value": frequency, …}}{"type": "high_cardinality"}— no grouping possible (most values are near-unique)
subschema— recursive schema for the neighbouring entities (up to--hopslevels)
Start the server:
python app.pyThen open http://localhost:5000 in a browser.
Step 1 — Connect to a SPARQL endpoint by entering its URL and clicking Connect. The property list is fetched automatically. rdf:type is always included.
Step 2 — Define a pattern by selecting a property and a value. Both fields support free-text entry with autocomplete. The pattern used is ?x <property> <value> (or the appropriate literal form).
Step 3 — Build Schema runs the analysis (2 hops) and displays the summarized schema. Two views are available:
- JSON — raw schema as formatted JSON, with an Export JSON button.
- Diagram — UML-inspired graph where nodes represent entity sets and edges represent properties (solid arrows for outgoing, reverse arrows for incoming). Each node shows entity count, type distribution, and value statistics. Each edge shows the property name and occurrence distribution. Supports drag to pan and Ctrl+scroll to zoom. An Export SVG button downloads the diagram for use in documents.
Schema for entities of type Dataset in the DSKG knowledge graph.
JSON view
Diagram view

