Knowledge GraphsEnterprise AI

Entity and Relation Extraction & Compression

A deep dive into Anatypical's two-phase pipeline: GLiNER for zero-shot NER and a single-pass LLM for relation triplets, followed by cross-document deduplication and pre-materialized RelationStar summaries.

Dawson Bauer

June 4, 2026

Overview

The Anatypical pipeline extracts structured knowledge from unstructured text using a two-phase approach:

Entity Extraction — GLiNER (zero-shot NER) identifies named entities from document chunks with no task-specific training
Relation Extraction — A single-pass LLM call per chunk extracts (subject, predicate, object) triplets
Compression — Redundant entities and relations are deduplicated, normalized, and merged before Neo4j writes

The result is a knowledge graph where each document contributes structured facts that can be queried semantically, traversed relationally, and combined with retrieval-augmented generation.

Phase 1: Entity Extraction with GLiNER

Why GLiNER

Traditional NER models (spaCy, NLTK) are trained on fixed entity types (PERSON, ORG, LOC). Domain-specific documents — financial reports, legal contracts, medical literature — contain entity types these models miss.

GLiNER (Generalist and Lightweight Named Entity Recognition) is a zero-shot NER model that accepts arbitrary entity type labels at inference time. No fine-tuning required.

Configured entity types:

Anatypical configures GLiNER with a broad set of entity types tuned for enterprise documents — people, organizations, locations, products, events, technologies, concepts, dates, monetary values, regulations, documents, roles, metrics, and industries.

Chunking Strategy

GLiNER operates on individual chunks (typically 500-1000 tokens). Anatypical:

Runs the full ingestion pipeline via LangGraph nodes: load → split → extract_nlp → cognify
The extract_nlp node calls GLiNER on each chunk, collecting {text, label, score} dicts
Chunks + entity candidates are passed to extract_graph_task via Celery

Cap: nlp_chunk_cap: 100 — GLiNER is run on up to 100 chunks per document to bound memory and time.

Phase 2: Relation Extraction with LLM

1-Pass vs. Multi-Pass

Cognee's default cognify pipeline uses 3+ LLM passes per chunk (extract entities, then relations, then canonicalize). Anatypical replaces this with a single LLM call per chunk that extracts entity-relation-entity triplets directly:

The model is given the text and asked to extract every subject–predicate–object relationship as JSON. For a sentence like “Tim Cook has served as CEO of Apple since 2011,” it returns triplets such as Tim Cook → leads → Apple, and Tim Cook → has served since → 2011.

Cap: relation_chunk_cap: 50 — only the first 50 chunks are processed for relations, with relation_concurrency: 10 concurrent LLM calls via asyncio semaphore.

Predicate Canonicalization

Raw predicates are immediately canonicalized to prevent graph fragmentation. "leads", "is CEO of", "serves as CEO" all resolve to LEADS.

Phase 3: Compression and Deduplication

Entity Deduplication

Within a single document (ingest-time):

Entity IDs are deterministic: MD5(name.lower() + ":" + label). This means "Apple" appearing in 20 chunks produces exactly one Entity node.

Cross-chunk surface-form clustering via rapidfuzz:

Surface forms are clustered with fuzzy string matching — for example, “Apple Inc” and “Apple” within the same document score highly enough to be merged into a single canonical entity.

Cross-document (memify-time):

cross_document_entity_dedup() runs as the first step of memify_graph_task:

Load all Entity nodes for the user's space from Neo4j
Group by GLiNER label (only compare same-type entities)
Pairwise fuzzy matching within groups
Union-Find clustering: canonical = longest name
Redirect HAS_ENTITY, RELATES_TO, and CanonicalRelation edges to canonical
DETACH DELETE alias entities

Relation Deduplication

CanonicalRelation nodes are deduplicated on (canonical_predicate, from_entity_id, to_entity_id, user_id). If the same relation is extracted from two different chunks:

The first write creates the CanonicalRelation node
The second write adds the raw predicate to raw_predicates: List[str] on the existing node
No duplicate nodes are created

Phase 4: RelationStar Pre-Materialization

After entity and relation nodes are stored, extract_graph_task generates one RelationStar per entity — a pre-aggregated summary of all relations in which that entity participates:

For instance, Apple’s RelationStar gathers everything it participates in — acquisitions like Beats and Shazam, its leadership by Tim Cook, its Cupertino headquarters, and the products it makes such as iPhone, Mac, and iPad — into a single summary.

RelationStars are vectorized in Qdrant. A semantic query "companies that own music labels" retrieves Apple's RelationStar because "ACQUIRED (Beats)" is semantically close.

This avoids expensive multi-hop graph traversal at query time: instead of MATCH (a)-[r]-(b) across thousands of edges, the retrieval system finds pre-materialized summaries in milliseconds.

Storage Schema

On the storage side, Neo4j holds the core node types — entities, triplets, canonical relations (each carrying its canonical predicate plus the raw phrases that mapped to it), and RelationStars — every node tagged by user and space. Qdrant holds parallel vector collections for entity names, triplet text, canonical predicates, and the RelationStar summaries and names, enabling semantic search at each level.

Performance Characteristics

Operation	Concurrency	Cap	Typical Time
GLiNER entity extraction	Sequential per chunk	100 chunks	2-5 min/doc
LLM relation extraction	10 concurrent	50 chunks	1-3 min/doc
Neo4j + Qdrant write	Batch	All entities	30-60s
RelationStar generation	Sequential	Per entity	5-20s total

All operations run in Celery workers on the heavy_llm_queue. Documents are fully available for retrieval after extract_graph_task completes, before memify_graph_task runs.

Keep Reading

Knowledge GraphsEnterprise AI

Branching Memory: Persistent Conversational Context in GraphRAG

Anatypical stores conversation turns as a persistent graph in Neo4j, enabling durable context, branching threads, and provenance tracking that survives session restarts.

June 3, 2026

Knowledge GraphsGlass Box

Source Traceability: From Answer Back to Passage

Every answer Anatypical generates is anchored to specific document passages and entities via persistent Neo4j graph edges — surviving re-ingestion, entity merges, and session restarts.

May 21, 2026

Knowledge GraphsEnterprise AI

Vadalog Semantic Grouping: Structured Predicate Taxonomy for Knowledge Graphs

How Anatypical normalizes inconsistent LLM-extracted predicates into a 30+ canonical predicate ontology across 13 semantic groups, preventing knowledge graph fragmentation.

May 21, 2026