GlassBox: Auditable AI Evaluation Middleware
GlassBox measures contextual precision, semantic faithfulness, and structural fidelity for any RAG system — then commits a tamper-proof trust scorecard to a Hyperledger Fabric ledger.
Overview
GlassBox is Anatypical’s standalone evaluation middleware that measures the quality of any LLM answer produced by any RAG system, agent, or direct LLM query. GlassBox is not tied to any specific retrieval architecture, it accepts a well defined input contract and produces a tamper-proof audit record committed to Hyperledger Fabric (HLF).
It answers three questions:
| Question | Judge |
|---|---|
| Did the retrieval surface the right content? | Contextual Precision |
| Is the answer semantically grounded and on-topic? | Semantic Faithfulness |
| Does the answer preserve the terminology, entities, and relationships present in the source documents? | Structural Fidelity |
Provider-Agnostic Input Contract
Any system submits an evaluation request to GlassBox via webhook or message queue. Only query, answer, and retrieved_passages are required.
Each request carries a session ID and optional model metadata — the provider and model name — alongside the query being asked and the answer the system produced. It can also include the retrieval strategy used, any entities and chunk-level relevance scores the retriever surfaced, the passages that were retrieved, and the full source documents they came from. Of all these fields, only three are mandatory: the query, the answer, and the retrieved passages used to produce it.
System Architecture
The flow is straightforward. Any system — a RAG pipeline, an agent, or a direct LLM call — generates an answer, then hands GlassBox the payload over a webhook or message queue. Because this happens asynchronously, it adds zero latency to the user-facing response. The GlassBox worker then runs three judges: Contextual Precision (via DeepEval), Semantic Faithfulness (via Ragas), and Structural Fidelity — itself a combination of NLP heuristics, GLiNER entity coverage, and GLiREL relation fidelity. Their scores are aggregated into a single trust score between 0 and 1, which is committed to the Hyperledger Fabric ledger, where chaincode enforces the schema and rejects any incomplete transaction.
Judge 1 — Contextual Precision
Question: Did the retriever surface the right content?
1a — LLM-Based Relevance (DeepEval)
DeepEval's ContextualPrecision asks an LLM to evaluate whether each retrieved passage contributed to a correct answer. Target: > 0.85.
1b — Semantic Similarity Distribution
Independently checks retriever calibration by computing embedding-based query-passage similarity and comparing it to the scores the retriever reported. If retrieved_entities is present, entity hit rate is also measured.
Judge 1 Score: 0.70 × deepeval_precision + 0.30 × score_1b
Judge 2 — Semantic Faithfulness
Question: Is the answer grounded in the retrieved context?
Judge 2 runs on the Ragas framework, which scores the answer on two metrics: faithfulness and answer relevancy.
Faithfulness (target > 0.95): Ragas decomposes the answer into atomic claims and checks each against the retrieved context. Claims not supported by any passage count as hallucinations.
Answer Relevancy (target > 0.80): Ragas generates synthetic questions from the answer and measures alignment with the original query. Low scores = "agent drift."
Judge 2 Score: 0.60 × faithfulness + 0.40 × answer_relevancy
Judge 3 — Structural Fidelity
Question: Does the answer preserve the terminology, entities, and relationships present in the source documents?
3a — Terminology Fidelity (NLP Heuristics)
BLEU-1/2/3/4 and ROUGE-1/2/L with Precision/Recall/F1 between the answer and source documents.
- BLEU-1 = word vocabulary overlap. BLEU-4 = 4-gram phrase preservation.
- High Recall + low Precision = selective copying with unsupported additions.
NLP Score: H-mean(BLEU-4, ROUGE-L F1)
3b — Entity Fidelity (GLiNER)
GLiNER extracts entities from both source documents and the answer. Any entity in the answer not present in the source is flagged as a hallucination candidate.
- Entity Precision (target > 0.90): Fraction of answer entities that appear in source docs.
- Novel Entities: Specific strings flagged individually for human review.
- Source Coverage: Fraction of source entities the answer mentions.
3c — Relation Fidelity (GLiREL)
GLiREL extracts (subject, predicate, object) triplets from both source and answer. Any relation asserted in the answer with no matching entity pair in source is flagged.
Relation Faithfulness (target > 0.90): Fraction of answer (subject, object) pairs that appear in source documents.
Judge 3 Score: Weighted combination of 3a, 3b, 3c (weight-normalized automatically when sub-components are disabled).
Composite Trust Score
The final trust score is a weighted blend of the three judges: Contextual Precision contributes 30%, Semantic Faithfulness 40%, and Structural Fidelity the remaining 30%. Structural Fidelity is itself a weighted mix — 30% from the NLP heuristics (the harmonic mean of BLEU-4 and ROUGE-L), 35% from GLiNER entity precision, and 35% from GLiREL relation faithfulness.
Trust Scorecard: HLF Ledger Schema
Every evaluation is committed to Hyperledger Fabric as an immutable record:
| Status | Condition |
|---|---|
VERIFIED_SUCCESS | trust_score ≥ threshold AND all per-judge thresholds met |
TRUST_WARNING | trust_score < trust_score_threshold |
FAITHFULNESS_WARNING | Ragas faithfulness or GLiREL relation_faithfulness below threshold |
PRECISION_WARNING | DeepEval context_precision or GLiNER entity_precision below threshold |
RELEVANCY_WARNING | Ragas answer_relevancy below threshold |
UNVERIFIED | One or more enabled judges failed to produce a score |
Configuration
GlassBox is fully configurable. Each judge can be enabled or disabled independently and given its own weight and pass/fail thresholds — DeepEval's precision threshold, Ragas's faithfulness and relevancy thresholds, and GLiNER and GLiREL's entity-precision and relation-faithfulness thresholds. A global trust-score threshold sets the overall pass bar, and the Hyperledger Fabric settings specify the channel and chaincode used to commit each scorecard.
GlassBox is dispatched asynchronously after answer generation — zero latency impact on the user-facing response.
Keep Reading
Source Traceability: From Answer Back to Passage
Every answer Anatypical generates is anchored to specific document passages and entities via persistent Neo4j graph edges — surviving re-ingestion, entity merges, and session restarts.
Vadalog Semantic Grouping: Structured Predicate Taxonomy for Knowledge Graphs
How Barnyard normalizes inconsistent LLM-extracted predicates into a 30+ canonical predicate ontology across 13 semantic groups, preventing knowledge graph fragmentation.
Tribrid RAG: Three-Signal Retrieval with MMR Fusion
Barnyard combines entity search (BM25 + vector), topic cluster retrieval, and knowledge graph expansion into a single ranked passage pool using Maximum Marginal Relevance fusion.