Microsoft ASSERT Framework Turns AI-Agent Policies Into Executable Tests


TL;DR

  • Framework Release: Microsoft has released ASSERT as an open-source framework for testing AI agents against organization-specific behavior requirements.
  • Evaluation Workflow: The framework turns policies into executable evaluations and connects failed scores to traces, outputs, and judge rationales.
  • Evidence Limits: Microsoft claims 80 to 90 percent judge-human agreement, but the figures remain first-party evidence.
  • Market Context: Langfuse, Phoenix, and Anthropic Bloom show behavior testing and trace-based evaluation are already crowded.

Microsoft has introduced ASSERT at Build 2026 as part of its open trust stack for AI agents, making it the next step after its earlier RAMPART and Clarity release. Released as open source, the framework helps developers catch company-specific AI-agent failures before deployment by turning behavior requirements into executable tests.

Microsoft also positioned ASSERT after an earlier Azure responsible AI safeguards rollout. Product teams building agents that call external systems can test their own rules instead of relying only on broad model benchmarks.

For enterprises, the practical point is not another generic scorecard. ASSERT takes organizational policies and requirements as input, generates evaluation scenarios, and surfaces safety or quality defects before production or during monitoring. An evaluation is a structured behavior test, while a regression test checks whether an update changed behavior that previously worked.

How ASSERT Turns Policies Into Tests

ASSERT generates behavior-specific test cases and can run them against hosted models, callable wrappers, or OpenTelemetry-traced agents. OpenTelemetry trace data helps teams connect a failed score to the agent workflow that produced it.

A typical workflow starts with a behavior specification, turns it into structured evaluations, then lets teams review, run, score, and improve those tests over time. Generated cases, traces, and local artifacts give developers a path from a failed score to the model output, system path, or judge rationale that needs review.

Because ASSERT works across frameworks from LangChain, CrewAI, LiteLLM, OpenAI, and others, teams can test policy behavior without moving the application into Microsoft Foundry. Cross-framework support fits enterprises whose agents may combine different orchestration layers, model providers, and monitoring stacks.

Microsoft previously put OpenTelemetry-based tracing into Azure AI Foundry, and ASSERT brings similar inspection pressure to behavior tests that can run across frameworks.