Knowledge GraphsEnterprise AI

Topic Clusters: Semantic Enrichment and Retrieval Anchoring

Barnyard's Memify phase groups related TextNodes into LLM-generated TopicClusters — semantic themes that improve retrieval recall and power the knowledge graph canvas view.

Dawson Bauer

Overview

After entities and relations are extracted from documents, Barnyard runs a second enrichment phase called Memify that groups related TextNodes into TopicClusters — semantic themes derived from the knowledge graph structure.

TopicClusters serve three purposes:

  1. Retrieval anchoring — Qdrant vectors of cluster summaries are semantically richer than raw document text, improving recall for thematic queries
  2. Canvas visualization — Clusters form the visual groupings in the knowledge graph canvas
  3. User-scoped retrieval — Clusters carry user_id/space_ids so user-scoped chunk retrieval works without exposing other users' data

Memify Pipeline

memify_graph_task runs after extract_graph_task completes (triggered by a Celery callback) and processes up to memify_chunk_cap: 100 TextNodes per call.

Step 1: Cross-Document Entity Deduplication

Before generating clusters, entity surface forms are normalized across all documents in the space. This ensures "Apple" and "Apple Inc." from two different documents are merged into one Entity node before topic analysis.

Step 2: TopicCluster Generation (LLM)

For each TextNode (up to the cap), the LLM generates a TopicCluster:

For each text node, the LLM is prompted as a knowledge graph analyst: given the entities and relationships from a document, produce a concise topic label, a two-to-three sentence summary of what the document is about, and a list of its most important entities.

Step 3: Neo4j and Qdrant Storage

Each TopicCluster is stored as a Neo4j node:

Each cluster is stored as a Neo4j node carrying its topic label, summary, and key entities, plus a reference to the text node that generated it and the owning user and spaces. A HAS_TOPIC_CLUSTER edge links the text node to its cluster.

A HAS_TOPIC_CLUSTER edge is created from the TextNode to its cluster.


Retrieval: Chunk Retrieval via TopicCluster

The chunk retrieval path (retrieve_chunks_node) uses TopicClusters as semantic anchors:

Retrieval uses these clusters as semantic anchors. A query like “What were Apple’s Q3 revenues?” first runs a vector search against the cluster summaries, matching, say, an “Apple Financial Performance Q3 2024” cluster. Results are filtered to the requesting user and their spaces, then expanded in Neo4j — following the HAS_TOPIC_CLUSTER edge back to the full text node — before a final round of MMR deduplication and parent-document expansion.

This two-stage approach (vector search on dense summaries → Neo4j expansion to full text) delivers:

  • Semantic matching without the noise of raw chunk text
  • User isolation — Qdrant post-filter by user_id prevents cross-user data leakage
  • Full document context — Neo4j expansion returns the complete TextNode.text, not just the matched fragment

Canvas Visualization

TopicClusters drive the main canvas view in the frontend:

The canvas endpoint returns the space's text nodes and clusters — each text node with its topic and entity names, and each cluster with its topic and entities — which the frontend renders as cards grouped by cluster.

The canvas renders TextNodes as cards grouped by their TopicCluster. Clicking "Open in Nested Canvas" drills into selected nodes to see their entities, relations, and clusters in detail.


Reingest: Rebuilding Clusters

The reingest_all_task rebuilds all derived graph structures from existing TextNodes and entities. Used after schema migrations, environment imports, or LLM model changes.

  • fast_rebuild=True (default): only aggregation structures rebuilt
  • fast_rebuild=False: GLiNER + LLM re-extraction runs on all TextNode texts (slower, fully refreshed)

Configuration

A single configuration cap limits how many text nodes each Memify run processes (100 by default).

Memify is triggered automatically after each extract_graph_task. For large document batches, it can also be triggered manually:

For large document batches it can also be triggered manually with a request that specifies the space to rebuild.