Recursive learning for AI agents. No fine-tuning, no GPUs. Your agents get smarter with every run. Automatically.
→ Read how it worksWe were founding engineers at Emergent.sh, where we helped scale from zero to $100M ARR. We deployed AI agents across every surface: customer support, code generation, document processing, data pipelines.
And we watched the same failure pattern, every time:
We realized the problem wasn't the model. The problem was that nobody was training the system around the model: the prompts, tools, skills, and routing logic that determine whether an agent succeeds or fails.
So we built recursive learning for AI agents.
Every time your agent runs, our system watches, learns, and improves the next run. Success or failure, it doesn't matter. No human in the loop required.
Detect the 4-8 decision points that actually mattered. Zero cost. Pure code heuristics.
Fast LLM call per decision. "What went wrong? What should have happened instead?"
Concrete artifacts: skills, routing rules, checklists, code templates. ~$0.02 each.
Only fixes that actually improve performance on held-out problems get promoted. Everything else dies.
Right knowledge, right moment. Not buried in a 10K-token system prompt.
The output isn't weights - it's readable files.
Skills. Checklists. Routing rules. Code templates. A human can read, edit, approve, or reject every piece of learned knowledge. When a new model drops, the knowledge transfers. Nothing is lost.
Reinforcement learning is a powerful tool. It's how OpenAI trained ChatGPT, how DeepMind beat Go, how frontier labs push the boundaries of reasoning.
It's also:
Most companies don't need to change the model's weights. They need to change what's around the model - the prompts, tools, context, and routing logic that determine whether an agent succeeds or fails on their specific tasks.
That's what we do.
| Fine-tuning / RL | Recursive Learning | |
|---|---|---|
| Cost | $100K+ per run | $500-1K in API calls |
| Time | Weeks | Hours |
| Output | Opaque weights | Readable files you can edit |
| New model drops | Start over | Knowledge auto-transfers |
| Infrastructure | GPU clusters | Just API calls |
| Team required | ML engineers | Any engineer |
| Auditability | None | Every artifact inspectable |
ARC-AGI-2 is one of the hardest AI benchmarks in the world - abstract reasoning puzzles designed to test genuine intelligence, not pattern matching. Most frontier models score below 40% without specialized scaffolding.
We used recursive learning to take a small, cheap model (Claude Haiku) and make it dramatically outperform its weight class.
Model: Claude Haiku (cheapest Claude model)
Injection overhead: ~$0.02-0.15 per run
Total learning phase: ~$600 in API calls across 70 training runs
| Learning phase | Tasks learned from | Accuracy (28 eval tasks) |
|---|---|---|
| Baseline (no learning) | 0 | 3.6% (1/28) |
| After 35 tasks | 35 | 35.7% (10/28) |
| After 70 tasks | 70 | 57.1% (16/28) |
The cheapest Claude model + recursive learning achieved results that frontier models typically can't match without specialized systems. The learning is worth more than the model upgrade.
Work in progress. We're actively iterating on these results. This is an early proof of concept. We'll publish a more robust submission with full methodology soon.
Every run makes the next run better. Wins teach what works. Losses teach what to avoid. The system extracts the difference automatically.
After 100 runs of accumulated knowledge, run 101 is dramatically better than run 1. A competitor starting fresh can't match your system's accumulated intelligence.
This is the moat. It's not our code. It's your data.
RAG retrieves similar past context. Your agent still has to figure out what to do with it. We generate prescriptive fixes - concrete skills, rules, and checklists that tell the agent exactly what to do differently.
These optimize prompt wording. We optimize full agent behavior: tools, skills, routing, mid-run interventions, and code templates. Prompt wording is maybe 20% of the problem.
They give you dashboards. You still need a human to review traces, spot patterns, and write fixes. We close the loop automatically - from failure detection to fix generation to validation to injection.
They change the model's weights (expensive, opaque, model-specific). We change what's around the model (cheap, readable, model-portable). See comparison table above.
We helped build and scale a company from 0 to $100M ARR. We saw firsthand how AI agent failures compound at scale, and how the current solutions (prompt engineering, fine-tuning, hope) don't work.
We're building the infrastructure layer that makes AI agents learn from experience. Not by changing the model - by changing everything around it.
Early access - we're working with a small number of teams.
→ Book a 15-min callOr reach out: kaushik@bentolabs.ai