From the founding engineers of Emergent.sh

AI agents make the same mistakes.
We make them learn.

Recursive learning for AI agents. No fine-tuning, no GPUs. Your agents get smarter with every run. Automatically.

→ Read how it works

See our ARC-AGI-2 results ↓

We've seen this problem before

We were founding engineers at Emergent.sh, where we helped scale from zero to $100M ARR. We deployed AI agents across every surface: customer support, code generation, document processing, data pipelines.

And we watched the same failure pattern, every time:

  1. Ship an agent. It works 70% of the time. Everyone's excited.
  2. Edge cases pile up. The agent hallucinates on the same patterns. Support tickets roll in.
  3. Prompt engineering treadmill. You patch one failure, break two others. Three engineers now babysit prompts full-time.
  4. New model drops. Half your patches stop working. Start over.
  5. Fine-tuning? $100K and weeks of work for opaque weights you can't inspect. Then the next model makes it obsolete.

We realized the problem wasn't the model. The problem was that nobody was training the system around the model: the prompts, tools, skills, and routing logic that determine whether an agent succeeds or fails.

So we built recursive learning for AI agents.

How recursive learning works

Every time your agent runs, our system watches, learns, and improves the next run. Success or failure, it doesn't matter. No human in the loop required.

1

Watch & Tag

Detect the 4-8 decision points that actually mattered. Zero cost. Pure code heuristics.

2

Analyze Each Decision

Fast LLM call per decision. "What went wrong? What should have happened instead?"

3

Generate Fixes

Concrete artifacts: skills, routing rules, checklists, code templates. ~$0.02 each.

4

Validate (A/B Tested)

Only fixes that actually improve performance on held-out problems get promoted. Everything else dies.

5

Inject Into Next Run

Right knowledge, right moment. Not buried in a 10K-token system prompt.

The output isn't weights - it's readable files.

Skills. Checklists. Routing rules. Code templates. A human can read, edit, approve, or reject every piece of learned knowledge. When a new model drops, the knowledge transfers. Nothing is lost.

RL is great. You probably don't need it.

Reinforcement learning is a powerful tool. It's how OpenAI trained ChatGPT, how DeepMind beat Go, how frontier labs push the boundaries of reasoning.

It's also:

Most companies don't need to change the model's weights. They need to change what's around the model - the prompts, tools, context, and routing logic that determine whether an agent succeeds or fails on their specific tasks.

That's what we do.

Fine-tuning / RL Recursive Learning
Cost $100K+ per run $500-1K in API calls
Time Weeks Hours
Output Opaque weights Readable files you can edit
New model drops Start over Knowledge auto-transfers
Infrastructure GPU clusters Just API calls
Team required ML engineers Any engineer
Auditability None Every artifact inspectable

Proof: ARC-AGI-2 benchmark

ARC-AGI-2 is one of the hardest AI benchmarks in the world - abstract reasoning puzzles designed to test genuine intelligence, not pattern matching. Most frontier models score below 40% without specialized scaffolding.

We used recursive learning to take a small, cheap model (Claude Haiku) and make it dramatically outperform its weight class.

3.6% → 57.1%
Baseline → After 70 tasks of learning

Model: Claude Haiku (cheapest Claude model)

Injection overhead: ~$0.02-0.15 per run

Total learning phase: ~$600 in API calls across 70 training runs

Learning phase Tasks learned from Accuracy (28 eval tasks)
Baseline (no learning) 0 3.6% (1/28)
After 35 tasks 35 35.7% (10/28)
After 70 tasks 70 57.1% (16/28)

The cheapest Claude model + recursive learning achieved results that frontier models typically can't match without specialized systems. The learning is worth more than the model upgrade.

Work in progress. We're actively iterating on these results. This is an early proof of concept. We'll publish a more robust submission with full methodology soon.

Learning compounds

Every run makes the next run better. Wins teach what works. Losses teach what to avoid. The system extracts the difference automatically.

Solve Learn Inject Solve better

After 100 runs of accumulated knowledge, run 101 is dramatically better than run 1. A competitor starting fresh can't match your system's accumulated intelligence.

This is the moat. It's not our code. It's your data.

How this compares

vs. RAG / Vector databases

RAG retrieves similar past context. Your agent still has to figure out what to do with it. We generate prescriptive fixes - concrete skills, rules, and checklists that tell the agent exactly what to do differently.

vs. Prompt optimization (DSPy, OPRO)

These optimize prompt wording. We optimize full agent behavior: tools, skills, routing, mid-run interventions, and code templates. Prompt wording is maybe 20% of the problem.

vs. LangSmith / trace analytics

They give you dashboards. You still need a human to review traces, spot patterns, and write fixes. We close the loop automatically - from failure detection to fix generation to validation to injection.

vs. Fine-tuning / RLHF

They change the model's weights (expensive, opaque, model-specific). We change what's around the model (cheap, readable, model-portable). See comparison table above.

Built for teams running agents at scale

Who we are

We helped build and scale a company from 0 to $100M ARR. We saw firsthand how AI agent failures compound at scale, and how the current solutions (prompt engineering, fine-tuning, hope) don't work.

We're building the infrastructure layer that makes AI agents learn from experience. Not by changing the model - by changing everything around it.

Make your agents learn.

Early access - we're working with a small number of teams.

→ Book a 15-min call

Or reach out: kaushik@bentolabs.ai