Entity and Relation Extraction & Compression
A deep dive into Barnyard's two-phase pipeline: GLiNER for zero-shot NER and a single-pass LLM for relation triplets, followed by cross-document deduplication and pre-materialized RelationStar summaries.
Overview
The Barnyard pipeline extracts structured knowledge from unstructured text using a two-phase approach:
- Entity Extraction — GLiNER (zero-shot NER) identifies named entities from document chunks with no task-specific training
- Relation Extraction — A single-pass LLM call per chunk extracts (subject, predicate, object) triplets
- Compression — Redundant entities and relations are deduplicated, normalized, and merged before Neo4j writes
The result is a knowledge graph where each document contributes structured facts that can be queried semantically, traversed relationally, and combined with retrieval-augmented generation.
Phase 1: Entity Extraction with GLiNER
Why GLiNER
Traditional NER models (spaCy, NLTK) are trained on fixed entity types (PERSON, ORG, LOC). Domain-specific documents — financial reports, legal contracts, medical literature — contain entity types these models miss.
GLiNER (Generalist and Lightweight Named Entity Recognition) is a zero-shot NER model that accepts arbitrary entity type labels at inference time. No fine-tuning required.
Configured entity types:
Barnyard configures GLiNER with a broad set of entity types tuned for enterprise documents — people, organizations, locations, products, events, technologies, concepts, dates, monetary values, regulations, documents, roles, metrics, and industries.
Chunking Strategy
GLiNER operates on individual chunks (typically 500-1000 tokens). Barnyard:
- Runs the full ingestion pipeline via LangGraph nodes:
load → split → extract_nlp → cognify - The
extract_nlpnode calls GLiNER on each chunk, collecting{text, label, score}dicts - Chunks + entity candidates are passed to
extract_graph_taskvia Celery
Cap: nlp_chunk_cap: 100 — GLiNER is run on up to 100 chunks per document to bound memory and time.
Phase 2: Relation Extraction with LLM
1-Pass vs. Multi-Pass
Cognee's default cognify pipeline uses 3+ LLM passes per chunk (extract entities, then relations, then canonicalize). Barnyard replaces this with a single LLM call per chunk that extracts entity-relation-entity triplets directly:
The model is given the text and asked to extract every subject–predicate–object relationship as JSON. For a sentence like “Tim Cook has served as CEO of Apple since 2011,” it returns triplets such as Tim Cook → leads → Apple, and Tim Cook → has served since → 2011.
Cap: relation_chunk_cap: 50 — only the first 50 chunks are processed for relations, with relation_concurrency: 10 concurrent LLM calls via asyncio semaphore.
Predicate Canonicalization
Raw predicates are immediately canonicalized to prevent graph fragmentation. "leads", "is CEO of", "serves as CEO" all resolve to LEADS.
Phase 3: Compression and Deduplication
Entity Deduplication
Within a single document (ingest-time):
Entity IDs are deterministic: MD5(name.lower() + ":" + label). This means "Apple" appearing in 20 chunks produces exactly one Entity node.
Cross-chunk surface-form clustering via rapidfuzz:
Surface forms are clustered with fuzzy string matching — for example, “Apple Inc” and “Apple” within the same document score highly enough to be merged into a single canonical entity.
Cross-document (memify-time):
cross_document_entity_dedup() runs as the first step of memify_graph_task:
- Load all Entity nodes for the user's space from Neo4j
- Group by GLiNER label (only compare same-type entities)
- Pairwise fuzzy matching within groups
- Union-Find clustering: canonical = longest name
- Redirect
HAS_ENTITY,RELATES_TO, andCanonicalRelationedges to canonical - DETACH DELETE alias entities
Relation Deduplication
CanonicalRelation nodes are deduplicated on (canonical_predicate, from_entity_id, to_entity_id, user_id). If the same relation is extracted from two different chunks:
- The first write creates the
CanonicalRelationnode - The second write adds the raw predicate to
raw_predicates: List[str]on the existing node - No duplicate nodes are created
Phase 4: RelationStar Pre-Materialization
After entity and relation nodes are stored, extract_graph_task generates one RelationStar per entity — a pre-aggregated summary of all relations in which that entity participates:
For instance, Apple’s RelationStar gathers everything it participates in — acquisitions like Beats and Shazam, its leadership by Tim Cook, its Cupertino headquarters, and the products it makes such as iPhone, Mac, and iPad — into a single summary.
RelationStars are vectorized in Qdrant. A semantic query "companies that own music labels" retrieves Apple's RelationStar because "ACQUIRED (Beats)" is semantically close.
This avoids expensive multi-hop graph traversal at query time: instead of MATCH (a)-[r]-(b) across thousands of edges, the retrieval system finds pre-materialized summaries in milliseconds.
Storage Schema
On the storage side, Neo4j holds the core node types — entities, triplets, canonical relations (each carrying its canonical predicate plus the raw phrases that mapped to it), and RelationStars — every node tagged by user and space. Qdrant holds parallel vector collections for entity names, triplet text, canonical predicates, and the RelationStar summaries and names, enabling semantic search at each level.
Performance Characteristics
| Operation | Concurrency | Cap | Typical Time |
|---|---|---|---|
| GLiNER entity extraction | Sequential per chunk | 100 chunks | 2-5 min/doc |
| LLM relation extraction | 10 concurrent | 50 chunks | 1-3 min/doc |
| Neo4j + Qdrant write | Batch | All entities | 30-60s |
| RelationStar generation | Sequential | Per entity | 5-20s total |
All operations run in Celery workers on the heavy_llm_queue. Documents are fully available for retrieval after extract_graph_task completes, before memify_graph_task runs.
Keep Reading
Source Traceability: From Answer Back to Passage
Every answer Anatypical generates is anchored to specific document passages and entities via persistent Neo4j graph edges — surviving re-ingestion, entity merges, and session restarts.
Vadalog Semantic Grouping: Structured Predicate Taxonomy for Knowledge Graphs
How Barnyard normalizes inconsistent LLM-extracted predicates into a 30+ canonical predicate ontology across 13 semantic groups, preventing knowledge graph fragmentation.
Tribrid RAG: Three-Signal Retrieval with MMR Fusion
Barnyard combines entity search (BM25 + vector), topic cluster retrieval, and knowledge graph expansion into a single ranked passage pool using Maximum Marginal Relevance fusion.