Knowledge GraphsEnterprise AI

Entity and Relation Extraction & Compression

A deep dive into Barnyard's two-phase pipeline: GLiNER for zero-shot NER and a single-pass LLM for relation triplets, followed by cross-document deduplication and pre-materialized RelationStar summaries.

Dawson Bauer

Overview

The Barnyard pipeline extracts structured knowledge from unstructured text using a two-phase approach:

  1. Entity Extraction — GLiNER (zero-shot NER) identifies named entities from document chunks with no task-specific training
  2. Relation Extraction — A single-pass LLM call per chunk extracts (subject, predicate, object) triplets
  3. Compression — Redundant entities and relations are deduplicated, normalized, and merged before Neo4j writes

The result is a knowledge graph where each document contributes structured facts that can be queried semantically, traversed relationally, and combined with retrieval-augmented generation.


Phase 1: Entity Extraction with GLiNER

Why GLiNER

Traditional NER models (spaCy, NLTK) are trained on fixed entity types (PERSON, ORG, LOC). Domain-specific documents — financial reports, legal contracts, medical literature — contain entity types these models miss.

GLiNER (Generalist and Lightweight Named Entity Recognition) is a zero-shot NER model that accepts arbitrary entity type labels at inference time. No fine-tuning required.

Configured entity types:

Barnyard configures GLiNER with a broad set of entity types tuned for enterprise documents — people, organizations, locations, products, events, technologies, concepts, dates, monetary values, regulations, documents, roles, metrics, and industries.

Chunking Strategy

GLiNER operates on individual chunks (typically 500-1000 tokens). Barnyard:

  1. Runs the full ingestion pipeline via LangGraph nodes: load → split → extract_nlp → cognify
  2. The extract_nlp node calls GLiNER on each chunk, collecting {text, label, score} dicts
  3. Chunks + entity candidates are passed to extract_graph_task via Celery

Cap: nlp_chunk_cap: 100 — GLiNER is run on up to 100 chunks per document to bound memory and time.


Phase 2: Relation Extraction with LLM

1-Pass vs. Multi-Pass

Cognee's default cognify pipeline uses 3+ LLM passes per chunk (extract entities, then relations, then canonicalize). Barnyard replaces this with a single LLM call per chunk that extracts entity-relation-entity triplets directly:

The model is given the text and asked to extract every subject–predicate–object relationship as JSON. For a sentence like “Tim Cook has served as CEO of Apple since 2011,” it returns triplets such as Tim Cook → leads → Apple, and Tim Cook → has served since → 2011.

Cap: relation_chunk_cap: 50 — only the first 50 chunks are processed for relations, with relation_concurrency: 10 concurrent LLM calls via asyncio semaphore.

Predicate Canonicalization

Raw predicates are immediately canonicalized to prevent graph fragmentation. "leads", "is CEO of", "serves as CEO" all resolve to LEADS.


Phase 3: Compression and Deduplication

Entity Deduplication

Within a single document (ingest-time):

Entity IDs are deterministic: MD5(name.lower() + ":" + label). This means "Apple" appearing in 20 chunks produces exactly one Entity node.

Cross-chunk surface-form clustering via rapidfuzz:

Surface forms are clustered with fuzzy string matching — for example, “Apple Inc” and “Apple” within the same document score highly enough to be merged into a single canonical entity.

Cross-document (memify-time):

cross_document_entity_dedup() runs as the first step of memify_graph_task:

  1. Load all Entity nodes for the user's space from Neo4j
  2. Group by GLiNER label (only compare same-type entities)
  3. Pairwise fuzzy matching within groups
  4. Union-Find clustering: canonical = longest name
  5. Redirect HAS_ENTITY, RELATES_TO, and CanonicalRelation edges to canonical
  6. DETACH DELETE alias entities

Relation Deduplication

CanonicalRelation nodes are deduplicated on (canonical_predicate, from_entity_id, to_entity_id, user_id). If the same relation is extracted from two different chunks:

  • The first write creates the CanonicalRelation node
  • The second write adds the raw predicate to raw_predicates: List[str] on the existing node
  • No duplicate nodes are created

Phase 4: RelationStar Pre-Materialization

After entity and relation nodes are stored, extract_graph_task generates one RelationStar per entity — a pre-aggregated summary of all relations in which that entity participates:

For instance, Apple’s RelationStar gathers everything it participates in — acquisitions like Beats and Shazam, its leadership by Tim Cook, its Cupertino headquarters, and the products it makes such as iPhone, Mac, and iPad — into a single summary.

RelationStars are vectorized in Qdrant. A semantic query "companies that own music labels" retrieves Apple's RelationStar because "ACQUIRED (Beats)" is semantically close.

This avoids expensive multi-hop graph traversal at query time: instead of MATCH (a)-[r]-(b) across thousands of edges, the retrieval system finds pre-materialized summaries in milliseconds.


Storage Schema

On the storage side, Neo4j holds the core node types — entities, triplets, canonical relations (each carrying its canonical predicate plus the raw phrases that mapped to it), and RelationStars — every node tagged by user and space. Qdrant holds parallel vector collections for entity names, triplet text, canonical predicates, and the RelationStar summaries and names, enabling semantic search at each level.


Performance Characteristics

OperationConcurrencyCapTypical Time
GLiNER entity extractionSequential per chunk100 chunks2-5 min/doc
LLM relation extraction10 concurrent50 chunks1-3 min/doc
Neo4j + Qdrant writeBatchAll entities30-60s
RelationStar generationSequentialPer entity5-20s total

All operations run in Celery workers on the heavy_llm_queue. Documents are fully available for retrieval after extract_graph_task completes, before memify_graph_task runs.