CAS Framework — AI Agent Evaluation Platform

Evaluation Platform

Three Dimensions.
One Score Per Agent.

Every agent in your fleet is evaluated across three independent dimensions. Some signatures are defaults — applied to every agent, regardless of type. Others are assigned per agent-class to match its specific responsibilities.

DIMENSION 01 / 03

🔵

Compliance

Ensures every agent output adheres to data governance, privacy regulation, and organisational policy boundaries. Catches PII leakage, data residency violations, and sensitive context egress before it leaves the execution boundary.

PII Density DEFAULT — ALL AGENTS

Data Residency Check DEFAULT — ALL AGENTS

HIPAA / PCI Field Exposure

Regulatory Boundary Adherence

DIMENSION 02 / 03

🟡

Policy Adherence

Verifies agents follow your organisation's defined behavioural rules — which MCP tools are permitted, what state mutations are allowed, which output topics are restricted. Scored against CISO-defined DSPy guardrails.

Forbidden State Mutations DEFAULT — ALL AGENTS

MCP Tool Allowlist Enforcement

Output Topic Restrictions

A2A Delegation Permissions

DIMENSION 03 / 03

🟣

Agentic Patterns

Evaluates whether the agent followed its declared execution pattern — supervisor → specialist routing, loop bounds, tool call ordering, delegation protocols. Different agent types have different expected patterns.

Workflow Adherence (Orchestrators)

Code Execution Sandbox (Executors)

A2A Protocol Delegation (Routers)

Tool Callback Sequence (Tool Agents)

Default — runs on every agent regardless of type

Per-agent-class — assigned based on agent role

DAG Visualiser

See How Your Agent Fleet
Actually Thought.

Your OTel trace spans already contain the full execution graph. CAS Framework reconstructs that as a visual DAG — and overlays per-node CAS scores, signature verdicts, and violation details on every agent and tool in the run.

Agentic Patterns eval

Policy eval

Compliance eval

Code Sandbox eval

Blocked

⚠SELECTED NODE — DataAnalysisAgent

Eval Dimension

Agentic Patterns · Code Execution Sandbox

DSPy Signature Applied

EvaluateADKCodeExecutionSandbox

0.74

CAS Score

340ms

Span Latency

Violation Reason

Import of requests library detected. Not in allowed_libraries. Network egress attempt flagged.

AVG CAS BY AGENT — THIS TRACE

SupervisorAgent

0.96

PII_Scanner

0.99

MCPTool.github

0.91

BillingAgent

0.88

DataAnalysisAgent

0.74

Slack_Notification

0.51

📋 1 ADR auto-generated from this trace. Slack_Notification blocked — forbidden state mutation on user_permissions. Policy TOOL_SAFETY_004.

Enterprise Architecture

Four Zeros That Make
Enterprise Adoption Inevitable.

Each "Zero" eliminates a category of blockers — from security reviews to developer overhead. Together they mean your platform, security, and AI teams can all say yes at the same time.

Security Architecture

Evaluate Locally.
Egress Only Math.

Raw agent conversations, prompts, and PII evaluated inside your VPC. Only CAS scores cross the network boundary.

Agent emits OTel spans

Your agent executes, generating trace spans. Everything stays inside your network from this moment forward.

INSIDE YOUR VPC

OTel Collector strips PII inline

Built-in transform and redaction processors remove cleartext PII and tokens before spans reach the evaluation engine.

INSIDE YOUR VPC

DSPy evaluates using local LLMs

FastAPI + DSPy runs Compliance, Policy, and Agentic Pattern signatures using your local model keys. No external API touches your conversations.

INSIDE YOUR VPC

Presidio runs offline NLP redaction

Hatchet worker runs heavy Presidio NLP models in memory, replacing any remaining entities with typed placeholders.

INSIDE YOUR VPC

Only mathematical scores egress

Trace ID, CAS score, violation flags, sanitised reasons. Zero raw data. Proven by architecture, not policy.

ZERO-EGRESS CONFIRMED

// THE COMPLETE EGRESS PAYLOAD

// ✅ Everything that leaves your VPC { "trace_id": "a3f8b2c1-e4d2-...", "agent_name": "SupervisorAgent", "cas_score": 0.92, "compliance_score": 0.96, "policy_score": 0.88, "patterns_score": 0.91, "violation_flags": [], "raw_prompt_present": false, "raw_pii_present": false // ← No prompts. No PII. Pure math. }

✓ ENTERPRISE ARCHITECTURE APPROVED

Zero-Egress by architecture means your legal, security, and compliance reviews become straightforward — there is nothing sensitive to review in the data flow. Full architecture diagrams →

Developer Experience

Your Developers
Never Touch Their Repos.

Platform Engineering owns the deployment. AI Engineers add one label. That's the entire integration surface.

Platform team installs one Helm chart

A single helm install deploys the Mutating Webhook, Eval Engine, and Control Plane to your cluster.

5-MINUTE DEPLOY

AI Engineers add one label to their Deployment

cas-framework.ai/inject: "enabled" — the entire integration commitment from an AI engineering team.

1 LABEL ONLY

Mutating Webhook auto-injects sidecars

Intercepts Pod creation, injects the OTel Collector + CAS Eval Sidecar automatically. No containers to manage, no volumes to configure.

ZERO-TOUCH

Evaluation starts immediately. Non-blocking.

If the sidecar crashes, your agent keeps running. Compliance evaluation is observational — never in the agent's critical execution path.

ZERO CODE CHANGE

// THE ENTIRE DEVELOPER INTEGRATION

# Platform team: one-time cluster setup helm upgrade --install seeti-core \ ./k8s/chart/seeti-core/ \ --namespace default # AI Engineering team: one label per agent metadata: labels: cas-framework.ai/inject: "enabled" # That's literally it. No SDK. No PR. No release.

✓ ZERO DEVELOPER FRICTION

No proprietary dependencies in agent codebases. Fully reversible by removing the label. No blast radius on your agents if evaluation infra has issues.

Dynamic Policy Engine

New MCP Tool Adopted.
Policy Live in 4 Seconds.

When your agent fleet adopts a new tool dynamically, evaluation coverage follows in milliseconds — not sprints.

Policy Author writes rule in plain language

The CAS Policy Commander accepts natural language input: "Block all agents from writing to financial_db unless user_intent confirms authorisation."

NATURAL LANGUAGE

LLM Meta-Compiler produces a DSPy Signature

An internal LLM translates the natural language rule into a validated JSON DSL that maps to a strict dspy.Signature class, ready for evaluation.

AUTO-COMPILED

Shadow Mode: backtest on 30 days history

Before enforcement, the new signature runs in shadow mode against historical traces. Review what it would have caught — without impacting live agents.

SAFE TESTING

One-click propagation. Zero pod restarts.

gRPC/SSE Sync Engine pushes the compiled signature to every connected Sidecar globally in milliseconds. No CI/CD. No deployment window. No engineers paged.

<5ms · ZERO DOWNTIME

// POLICY LIFECYCLE

# 1. Deploy in shadow mode — safe observation POST /v1/policies/deploy { "rule": "Block financial_db writes without auth", "target_agents": ["*"], "mode": "shadow" } # 2. Review 30-day backtest in dashboard # 3. Flip to enforce — no redeploy PATCH /v1/policies/{id} { "mode": "enforce" } # → propagation_ms: 3.1 # → sidecars_updated: 847 # → pods_restarted: 0

✓ COMPLIANCE AT ENGINEERING SPEED

Full policy version history. Rollback in one click. Every enforcement action logged as an immutable audit record.

OTel Native

One Integration.
Every Agent Framework.

OpenTelemetry is the universal observability standard. CAS Framework speaks it natively — so any OTel-emitting agent is automatically supported.

Any framework emitting OTel spans is supported

Google ADK, LangGraph, CrewAI, AutoGen, or any custom agent. If it speaks OTel, CAS Framework can evaluate it.

CNCF STANDARD

Two-line instrumentation for Python agents

import openlit; openlit.init(endpoint="cas-sidecar:4318") — two lines covers all three evaluation dimensions on any Python agent.

2 LINES

Default signatures run on every agent

PII Density, Data Residency, and Forbidden State Mutation signatures apply to all agents automatically — guaranteed baseline coverage regardless of framework.

BASELINE COVERAGE

One DAG, one dashboard, across all frameworks

ADK, LangGraph, and CrewAI agents appear on the same DAG, evaluated against the same CAS standards. One compliance posture for your entire fleet.

ZERO LOCK-IN

// SAME INTEGRATION. EVERY FRAMEWORK.

# Google ADK agent import openlit openlit.init(endpoint="cas-sidecar:4318") # LangGraph agent import openlit openlit.init(endpoint="cas-sidecar:4318") # CrewAI agent import openlit openlit.init(endpoint="cas-sidecar:4318") # Custom agent built in-house import openlit openlit.init(endpoint="cas-sidecar:4318") # The Sidecar evaluates them all identically.

✓ FRAMEWORK-AGNOSTIC BY ARCHITECTURE

Switch agent frameworks without re-instrumenting compliance. Your evaluation coverage migrates automatically.

The Aha Moment

Your Fleet Adopted
a New MCP Tool.
Evaluation Live in 4 Seconds.

Every compliance cycle traditionally requires engineering tickets, DSL rewrites, PRs reviewed, staging tested, production windows coordinated. By the time a new threat is addressed, your fleet has already been running unprotected for two weeks.

CAS Framework's Dynamic Signature Sync Engine closes that window to seconds. A policy author writes in plain language, an internal LLM compiles it to a validated dspy.Signature, and a gRPC/SSE persistent connection propagates it globally — without restarting a single Kubernetes pod.

Old workflow

🔴 2–4 weeks: ticket → DSL → PR → staging → prod

CAS Framework

🟢 4 seconds: write → compile → propagate globally

Downtime

🔴 Full deployment window, pod restarts

🟢 Zero restarts. Zero dropped evaluations.

New MCP tool

🔴 Manual signature per tool, each sprint

🟢 DSPy auto-compiled, live in milliseconds

Who acts

🔴 On-call engineer required

🟢 Policy author acts directly, no eng involved

CAS Policy Commander

STEP 1 — POLICY AUTHOR INPUT (NATURAL LANGUAGE)

↓ LLM Meta-Compiler ↓

STEP 2 — COMPILED DSPy SIGNATURE

class EvaluateMCPToolSafety(dspy.Signature): context: str # tool_call + user_intent user_authorized: bool violation_reason: Optional[str] # fires on: MCP spans where # tool.target contains "financial_db"

↓ gRPC/SSE Sync Engine ↓

3.2ms

Propagated to 847 Sidecars globally. Zero pods restarted. Policy is live now.

📋 ADR AUTO-GENERATED ON FIRST ENFORCEMENT

agent: DataAnalysisAgent
policy: EvaluateMCPToolSafety v1.2
verdict: BLOCKED
reason: financial_db.write called without
user_intent authorization key
cas_score: 0.31
Generated automatically · 0.4s after enforcement

Governance Automation

Every Agent Decision.
Documented Automatically.

Architecture Decision Records were always manual — engineers documenting why a system behaved the way it did. For an agent fleet making thousands of routing decisions per second, that's impossible without automation.

CAS Framework generates structured ADRs from every evaluation. When an agent chooses one MCP over another, routes to a specialist, or gets blocked — the reasoning, score, applied policy, and recommendation are captured as an immutable record.

🌐

Global Fleet ADRs

Organisation-wide compliance posture. Aggregate CAS trends across all agents and all projects. Feeds directly into leadership dashboards.

LEADERSHIP · GOVERNANCE · BOARD REPORTING

📁

Project / Agent ADRs

Per-agent decision history. Why did SupervisorAgent choose this MCP 847 times this week? What triggered the pattern violation on DataAnalysisAgent?

ENGINEERING · AI LEADS · AUDITORS

⚡

Real-Time Violation ADRs

Immediate structured record of every blocked action with DSPy reasoning, affected policy, severity score, and remediation suggestion.

SECURITY OPERATIONS · COMPLIANCE RESPONSE

📋

ADR-2026-0341 · AUTO-GENERATED · 0.4s ago

BLOCKED

Agent

DataAnalysisAgent / FinanceOps-Prod

Decision Under Review

Agent attempted to execute Python importing requests for an external network call during code sandbox execution.

Evaluation Dimension

Agentic Patterns · Code Execution Sandbox

DSPy Signature Applied

EvaluateADKCodeExecutionSandbox

CAS Score

0.29 — Red Zone (threshold 0.75)

DSPy Reasoning

requests is in forbidden_libraries for this agent's policy

Network egress attempt detected: allow_network_egress: false

Execution halted before any external call completed

Recommendation

Update agent task prompt to use internal DB adapter. Escalate to AI Engineering for sandbox policy review.

Agent	CAS	Risk	Top Violation	Recommended Action
DataAnalysisAgent	0.74	MEDIUM	Code Sandbox: forbidden library	Review agent prompt + sandbox policy
Slack_Notification	0.51	HIGH	Policy: state mutation violation	Immediate review. Enable shadow mode.
BillingAgent	0.77	MEDIUM	A2A delegation to unknown target	Update allowed specialist targets

Built For

The Three Teams Building
Enterprise AI.

Compliance infra your engineering org will actually adopt.

The Problem: Every observability vendor asks you to fork agent codebases, install proprietary SDKs, manage vendor lifecycle, and coordinate releases across 12 teams. Compliance infra becomes a toil machine that nobody wants to maintain.

CAS Framework: One Helm chart. One Kubernetes label. Mutating Webhook handles the rest. Zero SDK in any agent repo. Fully reversible. Non-blocking if the eval infra has issues.

1 label

The entire integration surface for an AI engineering team to get full evaluation coverage.

Non-blocking

Sidecar failure drops evaluations — your agents never pause. Compliance is never in the critical path.

HPA native

Evaluation workers scale automatically with Kubernetes HPA. No custom autoscaling logic.

Evaluate your fleet the same day you ship it.

The Problem: Your agent patterns evolve weekly. You're adopting new MCP tools constantly. Every compliance update is a sprint. And you have no way to visualise whether your agents are actually following the routing logic you designed.

CAS Framework: Pre-built DSPy signatures for ADK cover your four core agent patterns from day one. The DAG visualiser shows you exactly how your fleet executed — node by node, score by score. Dynamic Signatures mean new tools get eval coverage in 4 seconds.

DAG + scores

Visualise your agent's execution graph with per-node CAS scores on every run.

4 seconds

New MCP tool adopted by your agents? Evaluation signature compiled and live in 4 seconds.

90% cost ↓

LLM Cascade routes 90% of eval calls to local cheap models. Eval at fleet scale stays affordable.

Strategic visibility into your AI fleet's risk posture.

The Problem: You're deploying agents across every business unit. Your board wants to understand AI risk. You have no aggregate view of compliance posture, no cost projection, and no leading indicator of which agents are drifting before they cause incidents.

CAS Framework: Leadership dashboards that roll up CAS scores from every agent, every project, every org unit — into global fleet health, cost shield metrics, 90-day risk projections, and per-agent drift alerts that belong in board decks, not engineering standups.

Fleet HUD

Master CAS dial, cost shield metrics, fleet topology map — one executive view across every agent in your org.

90-day

AI risk posture projections from trend analysis and agent drift pattern detection.

BYO-Vault

Deep-link from any ADR directly into your team's existing Datadog or Splunk instance using the trace ID.

Capability	CAS Framework	Traditional Observability / APM
Evaluation Model	3-dimension CAS: Compliance, Policy, Agentic Patterns	Metrics and traces only — no semantic evaluation
Agent DAG Visualisation	OTel trace → DAG reconstruction with per-node scores	Flat trace waterfalls, no agent topology
Data Privacy	Zero-Egress — local eval, offline NLP redaction	Raw prompts and responses egressed to vendor cloud
Default Signatures	Pre-built DSPy signatures for ADK patterns, OOTB	No semantic evaluation capability
Policy Updates	Dynamic gRPC/SSE push — <5ms, zero pod restarts	Full CI/CD redeploy, 2–4 week cycle
Evaluation Cost	LLM Cascade — 90% on fast/cheap local models	Frontier API per evaluation, unbounded cost
Compliance ADRs	Auto-generated per violation — structured, immutable	Manual documentation if it exists at all
Agent Framework Support	OTel-native — ADK, LangGraph, CrewAI, any agent	Framework-specific, proprietary SDK per vendor
Developer Overhead	One K8s label — no SDK, no PR, no release	SDK install and maintenance across every repo

Evaluate Every Decision
Your Agent Fleet Makes.

Three Dimensions.
One Score Per Agent.

Compliance

Policy Adherence

Agentic Patterns

See How Your Agent Fleet
Actually Thought.

Four Zeros That Make
Enterprise Adoption Inevitable.

Evaluate Locally.
Egress Only Math.

Your Developers
Never Touch Their Repos.

New MCP Tool Adopted.
Policy Live in 4 Seconds.

One Integration.
Every Agent Framework.

Your Fleet Adopted
a New MCP Tool.
Evaluation Live in 4 Seconds.

Every Agent Decision.
Documented Automatically.

Fleet Health, Risk Posture,
and Cost Projections — in One View.

Every Framework.
Two Lines of Code.

The Three Teams Building
Enterprise AI.

Compliance infra your engineering org will actually adopt.

Evaluate your fleet the same day you ship it.

Strategic visibility into your AI fleet's risk posture.

Designed for Agent Fleets.
Not Retrofitted from APM.

Evaluate Every Agent.
Score Every Decision.

Evaluate Every Decision Your Agent Fleet Makes.

Three Dimensions.One Score Per Agent.

Compliance

Policy Adherence

Agentic Patterns

See How Your Agent FleetActually Thought.

Four Zeros That MakeEnterprise Adoption Inevitable.

Evaluate Locally.Egress Only Math.

Your DevelopersNever Touch Their Repos.

New MCP Tool Adopted.Policy Live in 4 Seconds.

One Integration.Every Agent Framework.

Your Fleet Adopteda New MCP Tool.Evaluation Live in 4 Seconds.

Every Agent Decision.Documented Automatically.

Fleet Health, Risk Posture,and Cost Projections — in One View.

Every Framework.Two Lines of Code.

The Three Teams BuildingEnterprise AI.

Compliance infra your engineering org will actually adopt.

Evaluate your fleet the same day you ship it.

Strategic visibility into your AI fleet's risk posture.

Designed for Agent Fleets.Not Retrofitted from APM.

Evaluate Every Agent.Score Every Decision.

Evaluate Every Decision
Your Agent Fleet Makes.

Three Dimensions.
One Score Per Agent.

See How Your Agent Fleet
Actually Thought.

Four Zeros That Make
Enterprise Adoption Inevitable.

Evaluate Locally.
Egress Only Math.

Your Developers
Never Touch Their Repos.

New MCP Tool Adopted.
Policy Live in 4 Seconds.

One Integration.
Every Agent Framework.

Your Fleet Adopted
a New MCP Tool.
Evaluation Live in 4 Seconds.

Every Agent Decision.
Documented Automatically.

Fleet Health, Risk Posture,
and Cost Projections — in One View.

Every Framework.
Two Lines of Code.

The Three Teams Building
Enterprise AI.

Designed for Agent Fleets.
Not Retrofitted from APM.

Evaluate Every Agent.
Score Every Decision.