Vadalog Semantic Grouping: Structured Predicate Taxonomy for Knowledge Graphs
How Barnyard normalizes inconsistent LLM-extracted predicates into a 30+ canonical predicate ontology across 13 semantic groups, preventing knowledge graph fragmentation.
Overview
Raw relation extraction from LLMs produces inconsistent predicate labels. The same relationship may be expressed as "works at", "is employed by", "is on staff at" — all meaning the same thing. Without normalization, the knowledge graph becomes fragmented and vector search misses related facts.
Barnyard's Vadalog-inspired semantic grouping system:
- Canonicalizes raw predicates to a fixed ontology via synonym matching and fuzzy scoring
- Groups canonical predicates into 13 named semantic categories (aligned with TACRED)
- Enriches RelationStar summaries with category labels so retrieval queries find semantically related relations even without exact predicate matching
Predicate Canonicalization Pipeline
Three-Tier Resolution
Each raw predicate runs through a three-tier resolution. First, an exact synonym lookup maps known phrasings straight to a canonical predicate. If that misses, a fuzzy match (using rapidfuzz, with a default threshold of 85) catches close variants — “took a controlling stake in” scores highly against “owns a stake in” and resolves to OWNS. Next, a bidirectional substring check handles partial overlaps. Finally, anything still unmatched falls back to an all-caps version of the raw phrase.
The fuzzy tier (2.5) is the key innovation — it catches novel LLM phrasings that don't appear in the synonym list but are semantically close to known synonyms.
Canonical Predicate Ontology
The ontology covers 30+ canonical predicates across four domains:
Person relations (TACRED per:*)
WORKS_AT, LED_BY, LEADS, MARRIED_TO, PARENT_OF, CHILD_OF, SIBLING_OF, EDUCATED_AT, BORN_IN, DIED_IN, NATIONALITY, CHARGED_WITH
Organisation relations (TACRED org:*)
OWNS, ACQUIRED, ACQUIRED_BY, EMPLOYS, SHAREHOLDER_OF, AFFILIATED_WITH, HEADQUARTERED_IN, PART_OF, PARTNERS_WITH, COMPETES_WITH
Creation & Investment
CREATED_BY, PRODUCES, FUNDED_BY, INVESTS_IN, SUCCEEDS, SUCCEEDED_BY
Causality & Dependency (ConceptNet/SemEval)
USES, CAUSED_BY, CAUSES, REGULATED_BY, LOCATED_IN, RELATED_TO
Semantic Groups (TACRED-Aligned Taxonomy)
| Group | Predicates |
|---|---|
EMPLOYMENT | WORKS_AT, EMPLOYS, LEADS, LED_BY |
FAMILY | MARRIED_TO, PARENT_OF, CHILD_OF, SIBLING_OF |
MEMBERSHIP | PART_OF, AFFILIATED_WITH |
CONTROL | OWNS, ACQUIRED, ACQUIRED_BY, SHAREHOLDER_OF, SUCCEEDED_BY, SUCCEEDS |
LOCATION | LOCATED_IN, HEADQUARTERED_IN, BORN_IN, DIED_IN, NATIONALITY |
CREATION | CREATED_BY, PRODUCES |
EDUCATION | EDUCATED_AT |
LEGAL | REGULATED_BY, CHARGED_WITH |
FUNDING | FUNDED_BY, INVESTS_IN |
PARTNERSHIP | PARTNERS_WITH |
COMPETITION | COMPETES_WITH |
DEPENDENCY | USES, CAUSED_BY, CAUSES |
RELATED | RELATED_TO (catch-all) |
Impact on RelationStar Summaries
Without semantic grouping:
Without semantic grouping, Apple’s summary is a flat list: leads Tim Cook, acquired Beats and Shazam, located in Cupertino, produces iPhone and Mac.
With semantic grouping:
With grouping, those same facts are organised under category labels — CONTROL (acquired Beats and Shazam), EMPLOYMENT (leads Tim Cook), LOCATION (located in Cupertino), and CREATION (produces iPhone and Mac).
The category labels appear in the stored star_summary string, which is vectorized in Qdrant. A query like "who controls Apple?" now retrieves the RelationStar because "CONTROL" appears prominently in the embedding.
Cross-Document Entity Deduplication
Vadalog-style grouping also drives cross-document entity deduplication in memify_graph_task:
- Load all Entity nodes for the user's space
- Group by GLiNER label (e.g.
ORGANIZATION) - Within each label group, compute pairwise
rapidfuzz.WRatio - Entities with score ≥ threshold (default: 88) are merged:
HAS_ENTITYandRELATES_TOedges are redirected to the canonical entity (longest name wins), and the alias is deleted
This prevents "Apple" and "Apple Inc." from appearing as separate nodes after two documents are ingested.
Configuration
A single fuzzy-threshold setting controls how aggressive the matching is — a lower value merges more phrasings together, while a higher value only collapses near-identical ones.
At 85, "took a stake in" (WRatio ≈ 87 vs "holds stake in") maps to SHAREHOLDER_OF. At 95, only near-identical phrasings match.
Keep Reading
Source Traceability: From Answer Back to Passage
Every answer Anatypical generates is anchored to specific document passages and entities via persistent Neo4j graph edges — surviving re-ingestion, entity merges, and session restarts.
Tribrid RAG: Three-Signal Retrieval with MMR Fusion
Barnyard combines entity search (BM25 + vector), topic cluster retrieval, and knowledge graph expansion into a single ranked passage pool using Maximum Marginal Relevance fusion.
Perplexity Gate: Adaptive Retrieval Routing
The perplexity gate decides whether a query needs document retrieval — using either a structured LLM classifier or token log-probabilities — before routing to the retrieval pipeline.