TL;DR
- Framework Release: Microsoft has released ASSERT as an open-source framework for testing AI agents against organization-specific behavior requirements.
- Evaluation Workflow: The framework turns policies into executable evaluations and connects failed scores to traces, outputs, and judge rationales.
- Evidence Limits: Microsoft claims 80 to 90 percent judge-human agreement, but the figures remain first-party evidence.
- Market Context: Langfuse, Phoenix, and Anthropic Bloom show behavior testing and trace-based evaluation are already crowded.
Microsoft has introduced ASSERT at Build 2026 as part of its open trust stack for AI agents, making it the next step after its earlier RAMPART and Clarity release. Released as open source, the framework helps developers catch company-specific AI-agent failures before deployment by turning behavior requirements into executable tests.
Microsoft also positioned ASSERT after an earlier Azure responsible AI safeguards rollout. Product teams building agents that call external systems can test their own rules instead of relying only on broad model benchmarks.
For enterprises, the practical point is not another generic scorecard. ASSERT takes organizational policies and requirements as input, generates evaluation scenarios, and surfaces safety or quality defects before production or during monitoring. An evaluation is a structured behavior test, while a regression test checks whether an update changed behavior that previously worked.
How ASSERT Turns Policies Into Tests
ASSERT generates behavior-specific test cases and can run them against hosted models, callable wrappers, or OpenTelemetry-traced agents. OpenTelemetry trace data helps teams connect a failed score to the agent workflow that produced it.
A typical workflow starts with a behavior specification, turns it into structured evaluations, then lets teams review, run, score, and improve those tests over time. Generated cases, traces, and local artifacts give developers a path from a failed score to the model output, system path, or judge rationale that needs review.
Because ASSERT works across frameworks from LangChain, CrewAI, LiteLLM, OpenAI, and others, teams can test policy behavior without moving the application into Microsoft Foundry. Cross-framework support fits enterprises whose agents may combine different orchestration layers, model providers, and monitoring stacks.
Microsoft previously put OpenTelemetry-based tracing into Azure AI Foundry, and ASSERT brings similar inspection pressure to behavior tests that can run across frameworks.
Lorenze Jay Hernandez, Open Source Lead at CrewAI, framed the value around that auditability.
“My favorite thing about ASSERT is that the eval is easy to configure and reason about. I describe the behavior I care about in YAML, point it at a real agent, and get artifacts back. Not just pass/fail. They show why the judge made each call. That openness matters. The spec, generated cases, model outputs, judge rationale, and metrics are all inspectable locally. For developers, the eval feels auditable, not like a black box.”
Lorenze Jay Hernandez, Open Source Lead at CrewAI (via Microsoft Foundry Blog)
Inspectable artifacts separate ASSERT from broad leaderboards. Product teams can describe a refund policy, safety rule, or tool-use boundary, then test whether a deployed agent keeps following that requirement as prompts, models, or application code change.
Microsoft’s Proof Points Need Careful Framing
Microsoft’s measurements give ASSERT useful but limited evidence. Microsoft says agreement between LLM judges and human annotators was usually in the 80 to 90 percent range, while human annotators agreed with one another at about 90 percent. That automated-judge comparison is relevant but still first-party because the scoring is against a rubric.
Microsoft also says ASSERT mapped roughly 1.2 times intended behavior space as a comparable internal baseline. Sarah Bird, Chief Product Officer of Responsible AI at Microsoft, said teams need to test many application-specific behavior dimensions to know whether an AI system meets an organization’s own bar, including after deployment and during continuous monitoring.
Those figures support Microsoft’s capability claim, not an independent verdict on real-world performance.
A Crowded Evaluation Market Shapes the Stakes
AI evaluation tools are moving beyond one-off benchmark scores toward application-specific behavior checks and agent trace inspection. Langfuse supports evaluation across live production traces, datasets, experiments, manual review, and automated evaluators. Arize Phoenix provides deterministic code-based evaluators and LLM-as-a-judge evaluators, with OpenTelemetry inspection for evaluator runs.
Anthropic released Bloom in 2025 as an open-source framework for generating behavioral evaluations of frontier models. Microsoft put ASSERT into its own sequence of developer-facing AI-safety tools after RAMPART and Clarity targeted AI-agent safety workflows.
ASSERT extends that push into policy-driven tests developers can run inside their own pipelines. Its cross-framework support covers LangChain, CrewAI, LiteLLM, OpenAI, and more, with LiteLLM integrations for more than 100 model endpoints.
ASSERT is available under the MIT license, making adoption possible without a Microsoft platform commitment. Adoption is the concrete gate: ASSERT checks need to block a release or flag a deployed agent before users see a policy failure.

