Glass BoxKnowledge GraphsEnterprise AI

GlassBox: Auditable AI Evaluation Middleware

GlassBox measures contextual precision, semantic faithfulness, and structural fidelity for any RAG system — then commits a tamper-proof trust scorecard to a Hyperledger Fabric ledger.

Dawson Bauer

Overview

GlassBox is Anatypical’s standalone evaluation middleware that measures the quality of any LLM answer produced by any RAG system, agent, or direct LLM query. GlassBox is not tied to any specific retrieval architecture, it accepts a well defined input contract and produces a tamper-proof audit record committed to Hyperledger Fabric (HLF).

It answers three questions:

QuestionJudge
Did the retrieval surface the right content?Contextual Precision
Is the answer semantically grounded and on-topic?Semantic Faithfulness
Does the answer preserve the terminology, entities, and relationships present in the source documents?Structural Fidelity

Provider-Agnostic Input Contract

Any system submits an evaluation request to GlassBox via webhook or message queue. Only query, answer, and retrieved_passages are required.

Each request carries a session ID and optional model metadata — the provider and model name — alongside the query being asked and the answer the system produced. It can also include the retrieval strategy used, any entities and chunk-level relevance scores the retriever surfaced, the passages that were retrieved, and the full source documents they came from. Of all these fields, only three are mandatory: the query, the answer, and the retrieved passages used to produce it.


System Architecture

The flow is straightforward. Any system — a RAG pipeline, an agent, or a direct LLM call — generates an answer, then hands GlassBox the payload over a webhook or message queue. Because this happens asynchronously, it adds zero latency to the user-facing response. The GlassBox worker then runs three judges: Contextual Precision (via DeepEval), Semantic Faithfulness (via Ragas), and Structural Fidelity — itself a combination of NLP heuristics, GLiNER entity coverage, and GLiREL relation fidelity. Their scores are aggregated into a single trust score between 0 and 1, which is committed to the Hyperledger Fabric ledger, where chaincode enforces the schema and rejects any incomplete transaction.


Judge 1 — Contextual Precision

Question: Did the retriever surface the right content?

1a — LLM-Based Relevance (DeepEval)

DeepEval's ContextualPrecision asks an LLM to evaluate whether each retrieved passage contributed to a correct answer. Target: > 0.85.

1b — Semantic Similarity Distribution

Independently checks retriever calibration by computing embedding-based query-passage similarity and comparing it to the scores the retriever reported. If retrieved_entities is present, entity hit rate is also measured.

Judge 1 Score: 0.70 × deepeval_precision + 0.30 × score_1b


Judge 2 — Semantic Faithfulness

Question: Is the answer grounded in the retrieved context?

Judge 2 runs on the Ragas framework, which scores the answer on two metrics: faithfulness and answer relevancy.

Faithfulness (target > 0.95): Ragas decomposes the answer into atomic claims and checks each against the retrieved context. Claims not supported by any passage count as hallucinations.

Answer Relevancy (target > 0.80): Ragas generates synthetic questions from the answer and measures alignment with the original query. Low scores = "agent drift."

Judge 2 Score: 0.60 × faithfulness + 0.40 × answer_relevancy


Judge 3 — Structural Fidelity

Question: Does the answer preserve the terminology, entities, and relationships present in the source documents?

3a — Terminology Fidelity (NLP Heuristics)

BLEU-1/2/3/4 and ROUGE-1/2/L with Precision/Recall/F1 between the answer and source documents.

  • BLEU-1 = word vocabulary overlap. BLEU-4 = 4-gram phrase preservation.
  • High Recall + low Precision = selective copying with unsupported additions.

NLP Score: H-mean(BLEU-4, ROUGE-L F1)

3b — Entity Fidelity (GLiNER)

GLiNER extracts entities from both source documents and the answer. Any entity in the answer not present in the source is flagged as a hallucination candidate.

  • Entity Precision (target > 0.90): Fraction of answer entities that appear in source docs.
  • Novel Entities: Specific strings flagged individually for human review.
  • Source Coverage: Fraction of source entities the answer mentions.

3c — Relation Fidelity (GLiREL)

GLiREL extracts (subject, predicate, object) triplets from both source and answer. Any relation asserted in the answer with no matching entity pair in source is flagged.

Relation Faithfulness (target > 0.90): Fraction of answer (subject, object) pairs that appear in source documents.

Judge 3 Score: Weighted combination of 3a, 3b, 3c (weight-normalized automatically when sub-components are disabled).


Composite Trust Score

The final trust score is a weighted blend of the three judges: Contextual Precision contributes 30%, Semantic Faithfulness 40%, and Structural Fidelity the remaining 30%. Structural Fidelity is itself a weighted mix — 30% from the NLP heuristics (the harmonic mean of BLEU-4 and ROUGE-L), 35% from GLiNER entity precision, and 35% from GLiREL relation faithfulness.


Trust Scorecard: HLF Ledger Schema

Every evaluation is committed to Hyperledger Fabric as an immutable record:

StatusCondition
VERIFIED_SUCCESStrust_score ≥ threshold AND all per-judge thresholds met
TRUST_WARNINGtrust_score < trust_score_threshold
FAITHFULNESS_WARNINGRagas faithfulness or GLiREL relation_faithfulness below threshold
PRECISION_WARNINGDeepEval context_precision or GLiNER entity_precision below threshold
RELEVANCY_WARNINGRagas answer_relevancy below threshold
UNVERIFIEDOne or more enabled judges failed to produce a score

Configuration

GlassBox is fully configurable. Each judge can be enabled or disabled independently and given its own weight and pass/fail thresholds — DeepEval's precision threshold, Ragas's faithfulness and relevancy thresholds, and GLiNER and GLiREL's entity-precision and relation-faithfulness thresholds. A global trust-score threshold sets the overall pass bar, and the Hyperledger Fabric settings specify the channel and chaincode used to commit each scorecard.

GlassBox is dispatched asynchronously after answer generation — zero latency impact on the user-facing response.