Same memory layer. Four model families. One methodology.
Last updated: 2026-05-05
Model parity
Memory portability is what makes Mycelium structurally different from a managed-cloud graph. The model-portability claim deserves a public test. The methodology below describes how Mycelium runs the same memory layer under Claude, GPT, Gemini, and Llama against an identical task corpus, with identical retrieval scoping, and publishes the comparative metrics.
What we measure
Three metrics per model family: retrieval recall on the typed-memory corpus, decision-aware accuracy on a held-out decision-resolution task, and per-task token cost normalized by output quality. Recall and accuracy come from a public eval set; token cost comes from each vendor's metered API. The point is not to crown a model; the point is to show the memory layer behaves the same way underneath.
Test corpus
Same ai-brain-starter test fixtures as the latency benchmark: 50,000 typed memory records spanning 24 months of valid time, plus a 200-task decision-resolution eval covering pricing exceptions, policy lookups, person-graph traversal, and time-bounded queries. The eval set is public so the harness is reproducible.
Models tested in v1
- Claude (Anthropic): Claude Opus 4.7, Claude Sonnet 4.6, Claude Haiku 4.5
- GPT (OpenAI): GPT-5 (default model class), GPT-4.1
- Gemini (Google): Gemini 2.5 Pro, Gemini 2.5 Flash
- Llama (Meta): Llama 4 Maverick, Llama 4 Scout via OpenRouter or self-hosted
How the harness works
- Step 1. Load the corpus fixture into the runtime.
- Step 2. For each model family, run all 200 tasks with identical retrieval scoping and identical agent prompts.
- Step 3. Score recall against the gold-standard memory citations.
- Step 4. Score decision accuracy against the gold-standard decision outputs.
- Step 5. Record per-task token cost from the vendor API.
- Step 6. Publish raw scores plus a pareto chart of accuracy versus cost.
Current status
| Methodology version | v1, published 2026-05-05 |
| Public harness | In development; ships in ai-brain-starter v0.5 |
| First public run | Scheduled with memory-runtime-pro v1.0 release |
| Verification standard | Reproducible third-party runs from the public harness against the public corpus and eval set |
What this page will not claim
Mycelium will not claim that any specific model is best for the customer's workload. The point of the parity test is the substrate, not the model. The customer picks the model on their own ground; Mycelium guarantees the memory layer behaves identically underneath. If a model family fails recall or accuracy on the public eval, that signal goes up on this page next to the others.
Mycelium · founded 2026