Glass BoxKnowledge GraphsEnterprise AI

GlassBox: Auditable AI Evaluation Middleware

GlassBox measures contextual precision, semantic faithfulness, and structural fidelity for any RAG system — then commits a tamper-proof trust scorecard to a Hyperledger Fabric ledger.

Dawson Bauer

May 20, 2026

Overview

GlassBox is Anatypical’s standalone evaluation middleware that measures the quality of any LLM answer produced by any RAG system, agent, or direct LLM query. GlassBox is not tied to any specific retrieval architecture, it accepts a well defined input contract and produces a tamper-proof audit record committed to Hyperledger Fabric (HLF).

It answers three questions:

Question	Judge
Did the retrieval surface the right content?	Contextual Precision
Is the answer semantically grounded and on-topic?	Semantic Faithfulness
Does the answer preserve the terminology, entities, and relationships present in the source documents?	Structural Fidelity

Provider-Agnostic Input Contract

Any system submits an evaluation request to GlassBox via webhook or message queue. Only query, answer, and retrieved_passages are required.

Each request carries a session ID and optional model metadata — the provider and model name — alongside the query being asked and the answer the system produced. It can also include the retrieval strategy used, any entities and chunk-level relevance scores the retriever surfaced, the passages that were retrieved, and the full source documents they came from. Of all these fields, only three are mandatory: the query, the answer, and the retrieved passages used to produce it.

System Architecture

The flow is straightforward. Any system — a RAG pipeline, an agent, or a direct LLM call — generates an answer, then hands GlassBox the payload over a webhook or message queue. Because this happens asynchronously, it adds zero latency to the user-facing response. The GlassBox worker then runs three judges: Contextual Precision (via DeepEval), Semantic Faithfulness (via Ragas), and Structural Fidelity — itself a combination of NLP heuristics, GLiNER entity coverage, and GLiREL relation fidelity. Their scores are aggregated into a single trust score between 0 and 1, which is committed to the Hyperledger Fabric ledger, where chaincode enforces the schema and rejects any incomplete transaction.

Judge 1 — Contextual Precision

Question: Did the retriever surface the right content?

1a — LLM-Based Relevance (DeepEval)

DeepEval's ContextualPrecision asks an LLM to evaluate whether each retrieved passage contributed to a correct answer. Target: > 0.85.

1b — Semantic Similarity Distribution

Independently checks retriever calibration by computing embedding-based query-passage similarity and comparing it to the scores the retriever reported. If retrieved_entities is present, entity hit rate is also measured.

Judge 1 Score: 0.70 × deepeval_precision + 0.30 × score_1b

Judge 2 — Semantic Faithfulness

Question: Is the answer grounded in the retrieved context?

Judge 2 runs on the Ragas framework, which scores the answer on two metrics: faithfulness and answer relevancy.

Faithfulness (target > 0.95): Ragas decomposes the answer into atomic claims and checks each against the retrieved context. Claims not supported by any passage count as hallucinations.

Answer Relevancy (target > 0.80): Ragas generates synthetic questions from the answer and measures alignment with the original query. Low scores = "agent drift."

Judge 2 Score: 0.60 × faithfulness + 0.40 × answer_relevancy

Judge 3 — Structural Fidelity

Question: Does the answer preserve the terminology, entities, and relationships present in the source documents?

3a — Terminology Fidelity (NLP Heuristics)

BLEU-1/2/3/4 and ROUGE-1/2/L with Precision/Recall/F1 between the answer and source documents.

BLEU-1 = word vocabulary overlap. BLEU-4 = 4-gram phrase preservation.
High Recall + low Precision = selective copying with unsupported additions.

NLP Score: H-mean(BLEU-4, ROUGE-L F1)

3b — Entity Fidelity (GLiNER)

GLiNER extracts entities from both source documents and the answer. Any entity in the answer not present in the source is flagged as a hallucination candidate.

Entity Precision (target > 0.90): Fraction of answer entities that appear in source docs.
Novel Entities: Specific strings flagged individually for human review.
Source Coverage: Fraction of source entities the answer mentions.

3c — Relation Fidelity (GLiREL)

GLiREL extracts (subject, predicate, object) triplets from both source and answer. Any relation asserted in the answer with no matching entity pair in source is flagged.

Relation Faithfulness (target > 0.90): Fraction of answer (subject, object) pairs that appear in source documents.

Judge 3 Score: Weighted combination of 3a, 3b, 3c (weight-normalized automatically when sub-components are disabled).

Composite Trust Score

The final trust score is a weighted blend of the three judges: Contextual Precision contributes 30%, Semantic Faithfulness 40%, and Structural Fidelity the remaining 30%. Structural Fidelity is itself a weighted mix — 30% from the NLP heuristics (the harmonic mean of BLEU-4 and ROUGE-L), 35% from GLiNER entity precision, and 35% from GLiREL relation faithfulness.

Trust Scorecard: HLF Ledger Schema

Every evaluation is committed to Hyperledger Fabric as an immutable record:

Status	Condition
`VERIFIED_SUCCESS`	`trust_score ≥ threshold` AND all per-judge thresholds met
`TRUST_WARNING`	`trust_score < trust_score_threshold`
`FAITHFULNESS_WARNING`	Ragas faithfulness or GLiREL relation_faithfulness below threshold
`PRECISION_WARNING`	DeepEval context_precision or GLiNER entity_precision below threshold
`RELEVANCY_WARNING`	Ragas answer_relevancy below threshold
`UNVERIFIED`	One or more enabled judges failed to produce a score

Configuration

GlassBox is fully configurable. Each judge can be enabled or disabled independently and given its own weight and pass/fail thresholds — DeepEval's precision threshold, Ragas's faithfulness and relevancy thresholds, and GLiNER and GLiREL's entity-precision and relation-faithfulness thresholds. A global trust-score threshold sets the overall pass bar, and the Hyperledger Fabric settings specify the channel and chaincode used to commit each scorecard.

GlassBox is dispatched asynchronously after answer generation — zero latency impact on the user-facing response.

Keep Reading

Knowledge GraphsGlass Box

Source Traceability: From Answer Back to Passage

Every answer Anatypical generates is anchored to specific document passages and entities via persistent Neo4j graph edges — surviving re-ingestion, entity merges, and session restarts.

May 21, 2026

Knowledge GraphsEnterprise AI

Entity and Relation Extraction & Compression

A deep dive into Anatypical's two-phase pipeline: GLiNER for zero-shot NER and a single-pass LLM for relation triplets, followed by cross-document deduplication and pre-materialized RelationStar summaries.

June 4, 2026

Knowledge GraphsEnterprise AI

Branching Memory: Persistent Conversational Context in GraphRAG

Anatypical stores conversation turns as a persistent graph in Neo4j, enabling durable context, branching threads, and provenance tracking that survives session restarts.

June 3, 2026