Knowledge GraphsEnterprise AI

Vadalog Semantic Grouping: Structured Predicate Taxonomy for Knowledge Graphs

How Anatypical normalizes inconsistent LLM-extracted predicates into a 30+ canonical predicate ontology across 13 semantic groups, preventing knowledge graph fragmentation.

Dawson Bauer

May 21, 2026

Overview

Raw relation extraction from LLMs produces inconsistent predicate labels. The same relationship may be expressed as "works at", "is employed by", "is on staff at" — all meaning the same thing. Without normalization, the knowledge graph becomes fragmented and vector search misses related facts.

Anatypical's Vadalog-inspired semantic grouping system:

Canonicalizes raw predicates to a fixed ontology via synonym matching and fuzzy scoring
Groups canonical predicates into 13 named semantic categories (aligned with TACRED)
Enriches RelationStar summaries with category labels so retrieval queries find semantically related relations even without exact predicate matching

Predicate Canonicalization Pipeline

Three-Tier Resolution

Each raw predicate runs through a three-tier resolution. First, an exact synonym lookup maps known phrasings straight to a canonical predicate. If that misses, a fuzzy match (using rapidfuzz, with a default threshold of 85) catches close variants — “took a controlling stake in” scores highly against “owns a stake in” and resolves to OWNS. Next, a bidirectional substring check handles partial overlaps. Finally, anything still unmatched falls back to an all-caps version of the raw phrase.

The fuzzy tier (2.5) is the key innovation — it catches novel LLM phrasings that don't appear in the synonym list but are semantically close to known synonyms.

Canonical Predicate Ontology

The ontology covers 30+ canonical predicates across four domains:

Person relations (TACRED per:*)

WORKS_AT, LED_BY, LEADS, MARRIED_TO, PARENT_OF, CHILD_OF, SIBLING_OF, EDUCATED_AT, BORN_IN, DIED_IN, NATIONALITY, CHARGED_WITH

Organisation relations (TACRED org:*)

OWNS, ACQUIRED, ACQUIRED_BY, EMPLOYS, SHAREHOLDER_OF, AFFILIATED_WITH, HEADQUARTERED_IN, PART_OF, PARTNERS_WITH, COMPETES_WITH

Creation & Investment

CREATED_BY, PRODUCES, FUNDED_BY, INVESTS_IN, SUCCEEDS, SUCCEEDED_BY

Causality & Dependency (ConceptNet/SemEval)

USES, CAUSED_BY, CAUSES, REGULATED_BY, LOCATED_IN, RELATED_TO

Semantic Groups (TACRED-Aligned Taxonomy)

Group	Predicates
`EMPLOYMENT`	`WORKS_AT`, `EMPLOYS`, `LEADS`, `LED_BY`
`FAMILY`	`MARRIED_TO`, `PARENT_OF`, `CHILD_OF`, `SIBLING_OF`
`MEMBERSHIP`	`PART_OF`, `AFFILIATED_WITH`
`CONTROL`	`OWNS`, `ACQUIRED`, `ACQUIRED_BY`, `SHAREHOLDER_OF`, `SUCCEEDED_BY`, `SUCCEEDS`
`LOCATION`	`LOCATED_IN`, `HEADQUARTERED_IN`, `BORN_IN`, `DIED_IN`, `NATIONALITY`
`CREATION`	`CREATED_BY`, `PRODUCES`
`EDUCATION`	`EDUCATED_AT`
`LEGAL`	`REGULATED_BY`, `CHARGED_WITH`
`FUNDING`	`FUNDED_BY`, `INVESTS_IN`
`PARTNERSHIP`	`PARTNERS_WITH`
`COMPETITION`	`COMPETES_WITH`
`DEPENDENCY`	`USES`, `CAUSED_BY`, `CAUSES`
`RELATED`	`RELATED_TO` (catch-all)

Impact on RelationStar Summaries

Without semantic grouping:

Without semantic grouping, Apple’s summary is a flat list: leads Tim Cook, acquired Beats and Shazam, located in Cupertino, produces iPhone and Mac.

With semantic grouping:

With grouping, those same facts are organised under category labels — CONTROL (acquired Beats and Shazam), EMPLOYMENT (leads Tim Cook), LOCATION (located in Cupertino), and CREATION (produces iPhone and Mac).

The category labels appear in the stored star_summary string, which is vectorized in Qdrant. A query like "who controls Apple?" now retrieves the RelationStar because "CONTROL" appears prominently in the embedding.

Cross-Document Entity Deduplication

Vadalog-style grouping also drives cross-document entity deduplication in memify_graph_task:

Load all Entity nodes for the user's space
Group by GLiNER label (e.g. ORGANIZATION)
Within each label group, compute pairwise rapidfuzz.WRatio
Entities with score ≥ threshold (default: 88) are merged: HAS_ENTITY and RELATES_TO edges are redirected to the canonical entity (longest name wins), and the alias is deleted

This prevents "Apple" and "Apple Inc." from appearing as separate nodes after two documents are ingested.

Configuration

A single fuzzy-threshold setting controls how aggressive the matching is — a lower value merges more phrasings together, while a higher value only collapses near-identical ones.

At 85, "took a stake in" (WRatio ≈ 87 vs "holds stake in") maps to SHAREHOLDER_OF. At 95, only near-identical phrasings match.

Keep Reading

Knowledge GraphsEnterprise AI

Entity and Relation Extraction & Compression

A deep dive into Anatypical's two-phase pipeline: GLiNER for zero-shot NER and a single-pass LLM for relation triplets, followed by cross-document deduplication and pre-materialized RelationStar summaries.

June 4, 2026

Knowledge GraphsEnterprise AI

Branching Memory: Persistent Conversational Context in GraphRAG

Anatypical stores conversation turns as a persistent graph in Neo4j, enabling durable context, branching threads, and provenance tracking that survives session restarts.

June 3, 2026

Knowledge GraphsGlass Box

Source Traceability: From Answer Back to Passage

Every answer Anatypical generates is anchored to specific document passages and entities via persistent Neo4j graph edges — surviving re-ingestion, entity merges, and session restarts.

May 21, 2026