Skip to content

mdaquin/KGProbSchema

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

KG Probabilistic Schema

Build a probabilistic schema for a subset of a knowledge graph. The subset is defined by a SPARQL graph pattern, and the schema characterises the properties and values of the matching entities: occurrence distributions, value type breakdowns, numeric statistics, and top frequent values. Can be used on local files (only CLI) and SPARQL endpoints.

Disclaimer: This tool relies on heavy SPARQL queries, and was mostly tested on local SPARQL endpoints (i.e. on SPARQL endpoints from locally deployed triplestores). It will take a lot of time on files, and also, very likely, on remote SPARQL endpoint where it might not even complete. Online SPARQL endpoints having availability issues, we discourage you to try it with those unless you are sure the one you want to inspect can handle it.

Installation

pip install rdflib pandas SPARQLWrapper flask

CLI

python build_pschema.py <source> "<pattern>" [--hops N] <output_file>
Argument Description
source Path to a local RDF file or URL of a SPARQL endpoint
pattern SPARQL graph pattern using ?x as the main variable
--hops N Number of hops to explore (default: 2)
output_file Path for the output JSON file

Examples

Local RDF file, entities of type schema:Person:

python build_pschema.py data.ttl "?x a <http://schema.org/Person>" --hops 2 schema.json

Remote SPARQL endpoint, Wikidata humans:

python build_pschema.py https://query.wikidata.org/sparql \
    "?x wdt:P31 wd:Q5" --hops 1 schema.json

Output format

The output is a JSON file produced by summarizeSchema. For each property and inverse property of the matching entities it includes:

  • occurrences — frequency distribution of the number of values per entity (e.g. {"0": 0.05, "1": 0.88, "2": 0.07}), or descriptive statistics (avg, std, median, min, max) when cardinality is highly variable
  • types — frequency breakdown of value types (RDF class or XSD datatype)
  • values — one of:
    • {"type": "numeric", "avg": …, "std": …, "median": …, "min": …, "max": …}
    • {"type": "categorical", "top10": {"value": frequency, …}}
    • {"type": "high_cardinality"} — no grouping possible (most values are near-unique)
  • subschema — recursive schema for the neighbouring entities (up to --hops levels)

Web interface

Start the server:

python app.py

Then open http://localhost:5000 in a browser.

Usage

Step 1 — Connect to a SPARQL endpoint by entering its URL and clicking Connect. The property list is fetched automatically. rdf:type is always included.

Step 2 — Define a pattern by selecting a property and a value. Both fields support free-text entry with autocomplete. The pattern used is ?x <property> <value> (or the appropriate literal form).

Step 3 — Build Schema runs the analysis (2 hops) and displays the summarized schema. Two views are available:

  • JSON — raw schema as formatted JSON, with an Export JSON button.
  • Diagram — UML-inspired graph where nodes represent entity sets and edges represent properties (solid arrows for outgoing, reverse arrows for incoming). Each node shows entity count, type distribution, and value statistics. Each edge shows the property name and occurrence distribution. Supports drag to pan and Ctrl+scroll to zoom. An Export SVG button downloads the diagram for use in documents.

Screenshots

Schema for entities of type Dataset in the DSKG knowledge graph.

JSON view

JSON view

Diagram view

Diagram view

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages