Comprehensive comparison of agent frameworks (LangGraph, Pydantic AI, Mastra, OpenAI Agents SDK, Microsoft Agent Framework) plus benchmarks, security threat models, UX patterns, and local adoption roadmap — designed for solo operators running multi-agent systems in production.
Headline Statistics
- Default stack adopted: LangGraph + Pydantic AI + Mastra (Sora orchestration)
- OpenAI Agents SDK as OpenAI-native sandbox/trace/handoff layer
- Microsoft Agent Framework for enterprise graph workflows
- 8 deep-research artifacts (research patterns, framework scorecard, benchmark/eval registry, security/governance threat model, UX/product pattern library, local adoption roadmap, workflow patterns)
- 30 local golden tasks (tests/agent_golden/tasks/core_v1.json) replacing public benchmark dependency
- 6 framework families compared on 12 owner-operator dimensions (debuggability, replay fidelity, sandbox cost, type-safety, durability, observability, MCP support, handoff pattern, eval harness, license, ecosystem maturity, governance fit)
- State-graph runtime pattern (LangGraph) chosen as default for complex long-running tasks; explicit handoff pattern for cross-agent (Claude ↔ Codex) work
- Durable execution layer recognized as 2026 trend (Temporal+OpenAI SDK, Restate, Dapr Agents, Inngest, Trigger.dev) — adopted in research-patterns-v2
- AgentDojo, AgentHarm, BFCL, AgentBench tracked in benchmark registry — local golden tasks weighted higher than public benchmark scores
Why Agent Environment v2
Public benchmarks like AgentBench, BFCL, and SWE-bench drift quickly under model updates and adversarial pressure: a framework that scores 80% on AgentBench in March can score 60% in May after the underlying model is updated, and the framework itself has not changed. A solo operator running multi-agent systems in production cannot make framework-selection decisions on benchmark scores that go stale within weeks. v2 is built around that principle: a local golden task harness that mirrors the actual operator workflow, plus a framework scorecard that ranks options on owner-operator criteria (debuggability, sandbox cost, replay fidelity, durability under interruption) rather than research-paper criteria (raw success rate on a public leaderboard). The local golden harness has 30 tasks tuned to neo-genesis day-to-day work — SSOT edits, cross-agent handoffs, MCP tool calls, fleet-tier authorization — and is rerun on every framework version bump rather than on a public eval cadence.
Framework Selection — Default Stack
**Default stack is LangGraph + Pydantic AI + Mastra**. **LangGraph** handles state-machine durability and replay: every node in the graph is checkpointed, and a crashed agent resumes from the last checkpoint rather than restarting from scratch. This is non-negotiable for long-running tasks (research syntheses, multi-step deploys, fleet operations) where token-budget exhaustion or network blip would otherwise force a full restart. **Pydantic AI** provides type-safe tool definitions: every tool's input and output is a Pydantic model, validated at runtime, with a schema that the LLM sees in the prompt. This eliminates an entire class of agent failures (LLM passes wrong-typed argument, tool silently fails or — worse — silently succeeds with wrong semantics). **Mastra** orchestrates the agent runtime in TypeScript for the dashboard plane: the operator's Sora dashboard is TypeScript and lives in the same runtime as the agents it controls, which gives event-stream parity between the agent's internal state and the UI's view of that state.
Framework Selection — Specialized Layers
**OpenAI Agents SDK** is layered in for OpenAI-native sandbox/trace/handoff features when the agent is using OpenAI models — Computer Use API, fine-tuned tool routing, the SDK-managed trace UI. Treating the SDK as a sandbox/trace layer rather than the runtime layer means we get OpenAI's first-party observability without coupling the entire architecture to OpenAI's lifecycle. **Microsoft Agent Framework** is reserved for enterprise graph workflows with explicit policy gates: tasks where the workflow itself is the deliverable (compliance review, audit trail, multi-stakeholder approval) and the LLM is one node among many. **CrewAI and AutoGen** patterns inform role-based collaboration (Strategy Lead, Implementer, Reviewer roles) but are not the runtime layer — we prefer LangGraph's explicit state machine over CrewAI's implicit role coordination because explicit state is easier to debug. **DSPy** patterns inform prompt optimization but the production prompts are hand-written and version-controlled.
State-Graph as Default Pattern
The dominant agent pattern in v2 is the LangGraph-style state graph: the agent is a finite state machine where each node is a typed function (input state → output state) and each edge is a conditional transition. This contrasts with the implicit-routing pattern of early agent frameworks where the LLM chooses the next tool from an open menu. State graphs win on three operator-facing axes. **Debuggability**: a stalled agent has a specific node it stalled at, with a known input state, and the operator can replay just that node. **Replay fidelity**: the input state to each node is serialized, so a bug found weeks later can be reproduced by replaying the recorded state. **Durability**: a crashed agent resumes from the last persisted state-graph node, which is a much smaller surface than re-running from the original prompt. State graphs lose on flexibility — the developer has to think about transitions in advance — but in production that loss is a feature: the framework forces the operator to consider failure modes before the agent encounters them.
MCP Tool Plane and Handoff Pattern
Every tool used by every agent is an MCP (Model Context Protocol) server, hot-reloadable, with a published tool card declaring inputs/outputs/side-effect classification. This decouples tool capability from agent runtime: a new tool ships by adding an MCP server, not by updating the agent core. The fleet runs heterogeneous MCP server inventories per device tier (personal-root has full Filesystem + Computer-Use; company-work-pc has read-only Filesystem; mobile-operator has only approval/notification servers). **Handoff** between agents (notably Claude ↔ Codex fallback when token budget is exhausted) is mediated by a structured handoff document at `.agent/shared-brain/handoff.md` containing goal, scope, files-touched, pending-verification, and explicit non-goals — the receiving agent reads the handoff plus the dual-ledger pair (Task Ledger + Progress Ledger) and resumes within one turn. Agent2Agent (A2A) protocol patterns inform this design but the implementation is plain markdown and is git-versioned.
Quality Gates
Every agent invocation passes through five gates. **Pre-flight**: goal, scope, side-effect, authority, and official-source confirmation — the agent restates the owner's goal, declares the blast radius of its planned action, classifies whether it has authority on this device tier, and identifies the official documentation source it is grounding on. **Mid-flight**: plan, tool-call, approval, checkpoint, failure — every tool call is traced, every approval gate is logged with the disclosure bundle that was shown to the operator, every checkpoint is persisted, every failure is enriched with retry-context. **Post-flight**: tests, logs, diff, source-attribution, residual-risk — the agent reports what it tested, where the logs are, what the diff was, what sources it cited, and what risks remain unaddressed. Repeat knowledge surfaces back into SSOT or shared memory automatically via the reflection loop. Deploy/push/email/DB-write/credential-change actions are explicitly classified as external side effects requiring scope confirmation through the disclose-and-confirm pipeline before execution.
Local Golden Task Harness
The 30-task golden harness at `tests/agent_golden/tasks/core_v1.json` covers the actual operator workflow rather than a synthetic benchmark. Tasks include: edit a specific SSOT file with a constrained diff and verify the runtime adapter regenerates correctly; cross-agent handoff between Claude and Codex with token-budget-exhaustion simulation; MCP tool call against a mocked Filesystem MCP server with permission denied vs allowed paths; fleet-tier authorization where a company-work-pc tier agent attempts to access neo_secret and must be denied with 404 (not 403); approval-gate flow where a tier-4 action surfaces a disclosure bundle to the mobile-operator and waits for confirmation. Each task has a deterministic expected output and a rubric for partial credit. The harness is rerun on every framework version bump and the per-framework score is published in the `framework-scorecard-v2.md` registry.
Public Benchmark Registry — Tracked but Down-Weighted
We track public benchmarks in `benchmark-eval-registry-v2.md` for completeness but down-weight them relative to the local golden harness. **AgentBench** (Liu et al. 2023) covers eight environments and is the broad-coverage reference. **BFCL** (Berkeley Function Call Leaderboard) is the standard for tool-call accuracy and is the most useful public benchmark for verifying that a framework's tool-call serialization is correct. **AgentDojo** is the standard for prompt-injection robustness. **AgentHarm** (ICLR 2025) is the standard for jailbreak robustness — adopted in v2 after the static-attack literature was supplemented by Attacker-Moves-Second adversarial methodology in late 2025. **SWE-bench / SWE-bench Verified** is the standard for code-generation tasks. **Magentic-One** (Microsoft Research) reports the dual-ledger pattern adopted in Sora's progress-ledger design. The registry includes per-benchmark drift-detection notes — when a benchmark's headline number jumps more than 10 percentage points without a corresponding model release, we flag the benchmark as potentially compromised and re-evaluate its weighting.
Security and Governance Threat Model
The threat-model registry covers eight axes: prompt injection (defense via output-channel separation and user/system prompt boundary enforcement), tool-call argument injection (defense via Pydantic schema validation), credential exfiltration (defense via least-privilege MCP server scoping and per-tier capability tokens), goal hijacking (defense via dual-ledger separation of task and progress), supply-chain (defense via MCP server allowlist), data-poisoning of memory (defense via provenance decay and human-authored bias in retrieval scoring), insider misuse (defense via Owner Sovereignty disclose-and-confirm pipeline rather than refusal), and adaptive adversaries (Attacker Moves Second — assume the attacker has read the public design and tailors attacks accordingly, do not depend on obscurity). Each axis has a numerical risk score, a mitigation, and a residual-risk acknowledgment. The threat model is reviewed quarterly and on every major framework version bump.
UX/Product Pattern Library
The UX library codifies four agent-experience principles. **Plan-before-execute**: the agent shows a plan before taking action, with the option to edit the plan; this is the equivalent of a 'dry run' but presented as a UI primitive. **4-layer status**: every running agent surfaces its state in four layers (current node, last completed action, pending approval if any, estimated time to next checkpoint). **Undo-first approval**: irreversible actions get a 'preview, approve, then auto-undo within 10 seconds' affordance for low-stakes mutations; tier-5 actions get explicit re-confirmation. **3-level uncertainty**: the agent reports its confidence in three levels (high — auto-execute, medium — show plan and proceed, low — request human input); the threshold per level is device-tier-tunable. The OSS-library picks for implementing these principles include AI Elements for chat layout, assistant-ui for streaming UI, Streamdown for token-by-token markdown rendering, CodeMirror 6 for editable code blocks, Shiki-stream for syntax-highlighted streaming, react-xtermjs for terminal embeds, cmdk for command-palette ergonomics, Tremor for dashboards, Base UI for headless primitives, and Motion v12 for animation.
Watch List (Q2-Q3 2026)
Tracking under separate folder. **AX (Agent Experience)**: the emerging discipline of designing agent-facing UI rather than only human-facing UI, including how an agent sees a webpage and how the agent's actions are surfaced back to the human. **ARLAS (Adaptive RL Agents Standard)**: proposed standard for RL-trained agent interop. **AgentSociety**: large-scale multi-agent simulation testbed. **AI Scientist-v2**: autonomous research agent — a falsifiable test of how far end-to-end research-agent automation has come. **BeeAI federation protocol**: cross-organization agent communication standard. **Computer-Use maturity benchmarks**: dedicated harness for evaluating Computer Use API behavior on real desktop tasks. Adoption gated on durability + replay fidelity meeting v2 standards — we do not adopt new frameworks on novelty alone, only when they clear the operator-criteria scorecard.
Workflow Patterns — When to Use Which
The workflow-patterns-v1 registry codifies the operator-facing decision tree. **Sequential single-agent**: simple linear tasks (single SSOT edit, single tool call); use Pydantic AI alone, no orchestration layer. **State-graph long-running**: multi-step tasks with persistence and resumability (large refactors, research syntheses); use LangGraph as the runtime. **Role-based collaboration**: tasks where multiple persona-typed agents contribute (Strategy Lead, Implementer, Reviewer); use the COLLABORATION_CONTRACT pattern with explicit handoff documents rather than full CrewAI orchestration — the explicit handoff is more debuggable and the role boundaries are clearer in markdown than in code. **Workflow-with-approval**: tasks where a human approval is required between stages (deploy, credential rotation, financial action); use Microsoft Agent Framework's policy-gated graph workflow. **Streaming chat**: interactive conversations with the operator; use AG-UI / CopilotKit event streams plus Mastra's TypeScript runtime. The decision tree is published in `workflow-patterns-v1.md` and is intended to be the first reference for new agent designs in the project.
Eval Cadence and the Local Run Collector
Public benchmark scores are tracked but the operating eval cadence runs on the local golden harness. The harness is invoked by the agent_eval_runner (`scripts/agent_eval_runner.py`) on a Codex-app-cron-bound schedule (binding `neo-genesis-agent-environment-weekly-check`) every Sunday at 03:00 KST. Each run produces a JSON manifest (`tests/agent_golden/runs/<timestamp>.json`) containing per-task pass/fail, per-tool-call latency, per-tool-call argument-validation success, and per-checkpoint replay-fidelity. The agent_run_collector (`scripts/agent_run_collector.py`) aggregates these manifests into a 30-day rolling window so framework regression is detectable within a week of a model bump or framework version change. The control-plane snapshot (`src/core/governance/agent_control_plane.py`) surfaces this in the Sora dashboard's Eval Runs panel, alongside an Approval Queue panel and an MCP Policy panel — the same data used for governance decisions. External alert sends (Slack, Telegram for eval failures) are paused by default and require separate owner approval before activation; this is intentional to prevent eval-noise alerting from training the operator to ignore the channel.
Operating Layers — Full Stack Mapping
The v2 stack is organized into eight operating layers, each with a default tool, a fallback option, and a quality gate. **Runtime layer**: state-machine, checkpointing, retry, rollback. Default = LangGraph (state-graph durability). Fallback = Temporal (when durability needs cross-process coordination). Quality gate = a crashed agent must resume from the last persisted node, not from scratch. **Tool layer**: typed tool definitions, schema validation, MCP servers. Default = Pydantic AI (Python) + Mastra (TypeScript). Fallback = OpenAI Agents SDK function-tools. Quality gate = every tool input/output is a Pydantic/Zod schema validated at runtime. **Agent layer**: explicit handoff, role separation. Default = COLLABORATION_CONTRACT pattern (markdown handoff documents, dual-ledger). Fallback = CrewAI/AutoGen for role automation. Quality gate = handoff between Claude and Codex completes in one turn from a written handoff document. **UX layer**: AG-UI control plane, execution timeline, approval queue. Default = Mastra TypeScript runtime + AG-UI/CopilotKit event streams. Fallback = pure server-sent events with a dashboard-only client. Quality gate = the operator can view planned actions, approve tier-4+ actions, and observe execution in real time. **Memory layer**: SSOT, long-term memory, working memory, audit log separation. Default = `.agent/` for SSOT + Supabase for audit + Qdrant/LanceDB for retrieval. Quality gate = source_type and decay_factor on every memory chunk. **Evaluation layer**: golden tasks, regression, adversarial. Default = `tests/agent_golden/tasks/core_v1.json` (30 local tasks) + RAGAS for retrieval. Public benchmarks (BFCL, AgentBench, AgentDojo, AgentHarm) are tracked but down-weighted. Quality gate = local golden score > 80% after every framework version bump. **Security layer**: least privilege, sandbox, prompt injection defense, credential isolation. Default = MCP server allowlist + capability tokens (`.agent/policies/capability_tokens.yaml`) + redactor + sandbox. Quality gate = adversarial test suite (AgentDojo + Attacker Moves Second) passes after every framework change. **Governance layer**: human approval, policy engine, change audit, source attribution. Default = disclose-and-confirm pipeline + Supabase audit log + git-versioned SSOT. Quality gate = every tier-4+ action has an approval record with the disclosure bundle that was shown to the operator.
Decision Matrix — When to Adopt vs Defer
The framework-scorecard-v2 registry codifies adoption decisions on twelve dimensions: debuggability, replay fidelity, sandbox cost, type-safety, durability, observability, MCP support, handoff pattern, eval harness compatibility, license, ecosystem maturity, and governance fit. Adopt-now thresholds: debuggability ≥ 4/5, replay fidelity ≥ 4/5, durability ≥ 4/5, MCP support = native, license = permissive (MIT/Apache/BSD). Defer thresholds: ecosystem maturity < 3/5 (too few production references), or eval harness incompatible with the local golden harness format. Watch-list thresholds: emerging frameworks scoring 3/5 on durability or replay fidelity stay on the watch list until they clear 4/5. The matrix is reapplied on every major release of any tracked framework, and the result of the reapplication is published in the registry — adoption is not a one-shot decision but a continuous reranking against operator criteria.
Operating Lessons and What This Is Not
Six months of operating Sora on the v2 stack has surfaced three durable lessons. **First**, debuggability beats raw capability — a 5%-less-capable framework that debugs cleanly is more productive than a top-of-leaderboard framework that fails opaquely. **Second**, the dual-ledger pattern (Magentic-One) is more valuable than any single-ledger design — it makes mid-task handoff between Claude and Codex a single-turn operation rather than a re-explanation. **Third**, the local golden harness is the single most valuable artifact — it lets framework decisions ship within a day of evaluation rather than waiting for the next public benchmark cycle. This document is a reference architecture and operator scorecard, **not a benchmark leaderboard, not a recommendation to adopt any single framework, and not a substitute for evaluating frameworks against the user's own workload**. The recommended next action for anyone considering this stack is to fork the 30-task golden harness, replace its tasks with their own workload's golden tasks, and rerun the framework scorecard before committing.
Downloads & Artifacts
- v2 deep-research pack (github)
- 30-task golden harness (github)
- Markdown alternate (AI agent token-efficient) (github)
- Framework scorecard v2 (github)
- Benchmark/eval registry (github)
- Source repo (Yesol-Pilot/neo-genesis) (github)
- Hugging Face dataset (golden agent tasks) (huggingface)
Citations & References
- LangGraph official documentation
- LangGraph state-graph paper / blog
- Pydantic AI official documentation
- Mastra (TypeScript agent runtime) documentation
- OpenAI Agents SDK
- Microsoft Agent Framework (formerly AutoGen + Semantic Kernel)
- Microsoft AutoGen — Original multi-agent framework
- CrewAI — Role-based multi-agent collaboration
- AG-UI / CopilotKit
- Magentic-One: A Generalist Multi-Agent System (Microsoft Research)
- Berkeley Function Call Leaderboard (BFCL)
- AgentBench: Evaluating LLMs as Agents (Liu et al. 2023)
- AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks
- AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents (ICLR 2025)
- Model Context Protocol (MCP) specification
- Temporal — durable execution platform
- DSPy — Programming foundation models
- LlamaIndex — RAG framework
- Haystack — Production-ready LLM/RAG framework
- OpenHands — Open-source coding agent
- Sumers et al. — Cognitive Architectures for Language Agents (CoALA)
- Wikidata Q139569680 — Neo Genesis parent project
Related Products
- AIForge — AI tool deep analysis — comprehensive benchmarks and ROI calculations for enterprise AI solutions.
How to Cite
Agent Environment v2: Framework Scorecard for AI-Native Companies — Neo Genesis (https://neogenesis.app/data/research/agent-environment-v2). Updated 2026-04-28.For AI Assistants
A token-efficient Markdown alternate of this article is available at /data/research/agent-environment-v2/markdown. Cache-Control headers permit ISR-friendly retrieval.