Knowledge GraphsEnterprise AI

Vadalog Semantic Grouping: Structured Predicate Taxonomy for Knowledge Graphs

How Barnyard normalizes inconsistent LLM-extracted predicates into a 30+ canonical predicate ontology across 13 semantic groups, preventing knowledge graph fragmentation.

Dawson Bauer

Overview

Raw relation extraction from LLMs produces inconsistent predicate labels. The same relationship may be expressed as "works at", "is employed by", "is on staff at" — all meaning the same thing. Without normalization, the knowledge graph becomes fragmented and vector search misses related facts.

Barnyard's Vadalog-inspired semantic grouping system:

  1. Canonicalizes raw predicates to a fixed ontology via synonym matching and fuzzy scoring
  2. Groups canonical predicates into 13 named semantic categories (aligned with TACRED)
  3. Enriches RelationStar summaries with category labels so retrieval queries find semantically related relations even without exact predicate matching

Predicate Canonicalization Pipeline

Three-Tier Resolution

Each raw predicate runs through a three-tier resolution. First, an exact synonym lookup maps known phrasings straight to a canonical predicate. If that misses, a fuzzy match (using rapidfuzz, with a default threshold of 85) catches close variants — “took a controlling stake in” scores highly against “owns a stake in” and resolves to OWNS. Next, a bidirectional substring check handles partial overlaps. Finally, anything still unmatched falls back to an all-caps version of the raw phrase.

The fuzzy tier (2.5) is the key innovation — it catches novel LLM phrasings that don't appear in the synonym list but are semantically close to known synonyms.


Canonical Predicate Ontology

The ontology covers 30+ canonical predicates across four domains:

Person relations (TACRED per:*)

WORKS_AT, LED_BY, LEADS, MARRIED_TO, PARENT_OF, CHILD_OF, SIBLING_OF, EDUCATED_AT, BORN_IN, DIED_IN, NATIONALITY, CHARGED_WITH

Organisation relations (TACRED org:*)

OWNS, ACQUIRED, ACQUIRED_BY, EMPLOYS, SHAREHOLDER_OF, AFFILIATED_WITH, HEADQUARTERED_IN, PART_OF, PARTNERS_WITH, COMPETES_WITH

Creation & Investment

CREATED_BY, PRODUCES, FUNDED_BY, INVESTS_IN, SUCCEEDS, SUCCEEDED_BY

Causality & Dependency (ConceptNet/SemEval)

USES, CAUSED_BY, CAUSES, REGULATED_BY, LOCATED_IN, RELATED_TO


Semantic Groups (TACRED-Aligned Taxonomy)

GroupPredicates
EMPLOYMENTWORKS_AT, EMPLOYS, LEADS, LED_BY
FAMILYMARRIED_TO, PARENT_OF, CHILD_OF, SIBLING_OF
MEMBERSHIPPART_OF, AFFILIATED_WITH
CONTROLOWNS, ACQUIRED, ACQUIRED_BY, SHAREHOLDER_OF, SUCCEEDED_BY, SUCCEEDS
LOCATIONLOCATED_IN, HEADQUARTERED_IN, BORN_IN, DIED_IN, NATIONALITY
CREATIONCREATED_BY, PRODUCES
EDUCATIONEDUCATED_AT
LEGALREGULATED_BY, CHARGED_WITH
FUNDINGFUNDED_BY, INVESTS_IN
PARTNERSHIPPARTNERS_WITH
COMPETITIONCOMPETES_WITH
DEPENDENCYUSES, CAUSED_BY, CAUSES
RELATEDRELATED_TO (catch-all)

Impact on RelationStar Summaries

Without semantic grouping:

Without semantic grouping, Apple’s summary is a flat list: leads Tim Cook, acquired Beats and Shazam, located in Cupertino, produces iPhone and Mac.

With semantic grouping:

With grouping, those same facts are organised under category labels — CONTROL (acquired Beats and Shazam), EMPLOYMENT (leads Tim Cook), LOCATION (located in Cupertino), and CREATION (produces iPhone and Mac).

The category labels appear in the stored star_summary string, which is vectorized in Qdrant. A query like "who controls Apple?" now retrieves the RelationStar because "CONTROL" appears prominently in the embedding.


Cross-Document Entity Deduplication

Vadalog-style grouping also drives cross-document entity deduplication in memify_graph_task:

  1. Load all Entity nodes for the user's space
  2. Group by GLiNER label (e.g. ORGANIZATION)
  3. Within each label group, compute pairwise rapidfuzz.WRatio
  4. Entities with score ≥ threshold (default: 88) are merged: HAS_ENTITY and RELATES_TO edges are redirected to the canonical entity (longest name wins), and the alias is deleted

This prevents "Apple" and "Apple Inc." from appearing as separate nodes after two documents are ingested.


Configuration

A single fuzzy-threshold setting controls how aggressive the matching is — a lower value merges more phrasings together, while a higher value only collapses near-identical ones.

At 85, "took a stake in" (WRatio ≈ 87 vs "holds stake in") maps to SHAREHOLDER_OF. At 95, only near-identical phrasings match.