Case Study

Rooted

The thinking tool that makes you think first — Socratic AI that asks, never answers.

Role: Full Stack Developer — Leptos, SpacetimeDB, Axum, OpenRouter

Background

Every major AI product today is optimized for one thing: getting you the answer faster. Type a question. Get an answer. Done. But there is a cost nobody is talking about — when you skip the struggle, the messy, uncertain, productive work of forming your own thought, you forfeit the thing that makes you capable.

Think about how a child learns multiplication. First they count on their fingers. Then slowly, through repetition and effort, they simulate those fingers in their head. Eventually the mental model is so solid they don't need fingers at all — they just know. And when you hand them a calculator on top of that foundation, they become powerful.

The Problem: Why Students Struggle with AI

Students who never struggle through an idea. Developers who ship code they don't understand. Professionals who outsource reasoning they should own. Not because they're lazy — because the tools make it effortless to skip the hard part.

The hard part is where the mind grows. The calculator is extraordinary. But only if you learned to count on your fingers first.

The Crisis in Numbers

67%

of students say AI harms their critical thinking (up from 54% in 2025)

Source: RAND American Youth Panel, Dec 2025

95%

of college faculty fear student overreliance on AI

Source: Elon/AAC&U National Survey, Jan 2026

62%

of students use AI for homework (up from 48% in May 2025)

Source: RAND American Youth Panel, Dec 2025

40%

of Harvard students tried to cut back on AI — and failed

Source: Harvard 7,000-student survey, 2026

The Research Is Clear

→ AI reliance correlates with 17% lower comprehension — A study of 52 programmers learning a new skill found the AI group scored significantly lower despite finishing faster. The task got done. The learning didn't.
→ Critical thinking scores declining — A 2025 study (r = -0.68 correlation) shows increased AI usage is linked to diminished ability to critically evaluate information and engage in reflective problem-solving.
→ Cognitive offloading at scale — When you hand a mental task to AI, your brain stops engaging the same way. The OECD's Digital Education Outlook 2026 warns of "metacognitive laziness" — students stop thinking about their own thinking.
→ Foundational skills declining — PISA 2022 shows sharp drops across OECD countries. Meanwhile, 80%+ of students use AI for assessed assignments, but fewer than 10% can verify the outputs.

The Literacy Foundation Is Crumbling

The problem starts even earlier than AI. Before students reach for AI calculators, many never developed the foundational thinking skills that make AI useful at all.

41%

of parents read to kids under 5 (down from 64% in 2012)

18.7%

of 8-18 year olds read daily — lowest ever recorded

32.7%

of children enjoy reading — lowest in 20 years

Sources: HarperCollins UK/Farshore (2024), National Literacy Trust (2025), National Assessment of Educational Progress (2022)

The Core Insight

Students aren't just using AI to get answers. They're using it to avoid the struggle that builds capability. And the research confirms they're aware of it — 40% tried to stop, couldn't, and are worried about what it's doing to their thinking.

The Framework: Desirable Difficulty

Robert Bjork's "Desirable Difficulty" theory explains why Rooted works. The counterintuitive finding: making learning feel harder in the right way produces dramatically better retention, understanding, and transferable skill. The struggle isn't a bug — it's the mechanism.

The Evidence Behind Desirable Difficulty

~30%

Retention increase when learners engage in effortful retrieval (testing effect) vs re-reading. Begin with questions learners can't yet answer — let them struggle, then reveal answers.

Source: Bjork & Bjork (2020), Journal of Applied Research in Memory and Cognition

20-25%

Performance improvement in teams trained with interleaved practice vs traditional massed training. Difficulty must be "desirable" — challenging but achievable.

Source: Bjork Lab, Taylor (2025)

Higher

Long-term retention from "generation effect" — when learners generate answers before receiving feedback, memory traces are stronger than passive re-reading.

Source: Research Square (2025), Multilevel models across 11,000+ skills

78%

Of Socratic AI interactions reflected growing autonomy in devising and validating solution strategies — students learn to think through problems rather than around them.

Source: Socratic Mind study, Georgia Tech/UC San Diego (2025)

Desirable Difficulty Impact Metrics

Research-backed evidence showing how productive struggle improves learning outcomes.

Retention~30%

Performance20-25%

Autonomy78%

Retrieval Effort

Interleaved Practice

Socratic Method

Sources: Bjork & Bjork (2020), Bjork Lab research, Socratic Mind study (Georgia Tech/UC San Diego, 2025)

The Core Principle

When learning feels easy, we retain it poorly. When it requires struggle, retrieval effort, and generation — we retain it deeply. Rooted is a "desirable difficulty machine" by design: forced drawing (generation), Socratic questions (retrieval effort), no answers given (productive struggle).

The Experiment

On March 29, 2026, a thought experiment was run. Instead of testing with an actual 5-year-old, Josh — Rooted's creator — pretended to be his young self and asked: "What would young Josh think multiplication is?"

The "child" drew. They grouped 6+6+6 into 12s. They counted how many sixes they'd used. They discovered that 6×3 = 18 on their own and doubled it. They arrived at 42 — not because someone told them, but because a single question at the right moment pulled them one step deeper each time.

Eight exchanges later — all drawings, all questions, no answers given — the same "child" discovered the distributive property of factoring themselves. At the end, they felt something closer to the opposite of relief — the quiet confidence of someone who figured it out themselves.

Note: This was a conceptual experiment to demonstrate the Socratic method, not an actual user test with a 5-year-old.

Prototype Evolution

The early prototypes of Rooted explored different interface approaches before settling on the three-column layout.

Early explorations of the Rooted interface, testing different layouts before converging on the three-column notebook design.

Current Version

The current version of Rooted implements the core Socratic loop — draw, think, ask. It uses a custom shape detector for math expressions and Gemini Vision for fallback analysis. The architecture is built for fleet orchestration, waiting for the agent research swarm to be implemented.

1 / 3

The Vision

Rooted is a notebook. But the notebook thinks back.

Three columns: Left shows the page list with thumbnails. Middle is the canvas — infinite, freehand, like a real piece of paper. Right is the conversation history — a record of the thinking journey, not a chat interface.

The AI response is never in a bubble. Just the question, breathable whitespace, a small 🌱 prefix. Calm and confident. The canvas never competes with the history. They're separate panes.

The Formula

Draw first. Think out loud. AI extends.

Every session follows one loop: The human externalizes their thought — in drawings, words, sketches, diagrams. The AI reads it. Instead of answering, it asks the one question most likely to push the thinking one layer deeper.

Not two questions. Not a paragraph of explanation. One question. This loop works for a young learner discovering multiplication. It works for a 13-year-old learning algebra. It works for a developer architecting a system. It works for a soldier rehearsing a procedure. It works for a scientist at the frontier.

Technical Architecture

Frontend

Leptos — Rust/WASM frontend, fast as a compiled native app. Native HTML5 Canvas for the infinite sketchpad. SpacetimeDB client syncs pages and messages in real-time.

Backend: Axum + SpacetimeDB

Axum handles the /api/submit-canvas endpoint — runs ShapeDetector ML, manages OpenRouter API calls, and enforces the Socratic prompt. SpacetimeDB persists pages, canvas PNGs, and conversation messages with real-time WebSocket sync to all clients.

Vision Analysis

Custom ShapeDetector ML (no external libraries) recognizes digits 0-9 and operators (+, -, ×, =). Falls back to Gemini Vision when confidence < 70%.

AI Agent

OpenRouter routes to google/gemini-3.1-flash-lite-preview. The ~2000-token system prompt defines the Socratic "Rooted" persona that never gives answers.

How It Works: The Technical Pipeline

1. Canvas Drawing (Leptos + HTML5 Canvas)

The frontend is built with Leptos, a Rust-based reactive web framework that compiles to WebAssembly. Users draw on an HTML5 Canvas element using two tools: pen (black strokes) and eraser (white strokes). Every 500ms, the canvas state is polled and synced to SpacetimeDB for persistence.

2. Canvas Submission to Backend

When the user types their intent (what they're stuck on) and clicks submit, the canvas is converted to a PNG and encoded as base64. The request includes:

page_id — which page this belongs to
image_base64 — the PNG canvas drawing
intent — what the student typed ("I'm trying to factor x² + 7x + 10")
history — previous conversation messages for context

3. Backend Processing (Axum + Rust)

The Axum server receives the request at /api/submit-canvas. The backend:

Extracts the base64 image and decodes it to PNG
Runs the ShapeDetector — custom template-matching ML that recognizes digits 0-9 and operators (+, -, ×, =)
Calculates a confidence score (0% to 95%)
If confidence ≥ 70%, uses the detected expression in the prompt without sending the image
If confidence < 70%, processes the image (resize to 1600×1200, +20 contrast) and sends it to Gemini Vision

4. The ShapeDetector: Custom ML Without External Libraries

The ShapeDetector analyzes strokes using pure geometry calculations — no TensorFlow or ONNX dependencies. Each stroke extracts ~18 features:

Shape Features: aspect ratio, circularity, density

Direction: horizontal, vertical, diagonal, curved

Topology: closed loops, endpoints, bends

Template matching compares these features against expected digit profiles. For example, "0" has high circularity and one stroke; "4" has two strokes at right angles.

5. The Two-Tier Vision Strategy

ML Detection (High Confidence)

Confidence ≥ 70% → Skip sending image to AI, include detected expression in text prompt. Saves tokens, reduces latency, lowers cost.

Vision Fallback (Low Confidence)

Confidence < 70% → Process image (resize + contrast), send to Gemini Vision for semantic understanding of the drawing.

6. The Socratic AI Persona (System Prompt)

The AI is defined by a ~2000-token system prompt that enforces two strict modes:

Thinking Mode

Drawing detected → Describe briefly (1 sentence) → Ask ONE diagnostic question → Never give answers

Conversation Mode

Blank canvas / casual chat → Talk naturally → Gently invite drawing if lost → Never robotically describe blank canvas

Key invariant: The AI must NEVER describe a drawing that isn't there. If the canvas is blank, say so or skip the description entirely.

7. Data Flow Diagram

CLIENT (Browser/WASM)
  Canvas Drawing (base64 PNG) + Intent + History
           │
           │ POST /api/submit-canvas
           ▼
AXUM BACKEND (Rust)
  1. Decode base64 → PNG
  2. ShapeDetector.analyze() → expression + confidence
  3. If confidence ≥ 70% → text only
     If confidence < 70% → resize + contrast → send to Vision
           │
           ▼
OPENROUTER API
  Model: google/gemini-3.1-flash-lite-preview
  System prompt enforces Socratic persona
           │
           ▼
RESPONSE
  Extract question from choices[0].message.content
  Return: { "question": "..." }

SpacetimeDB: Real-Time Persistence

SpacetimeDB provides the database layer — both server and real-time sync in one. The frontend connects via WebSocket to maincloud.spacetimedb.com.

Data Model

Page

Page {
  id: u64,           // Auto-generated unique ID
  owner: Identity,   // SpacetimeDB identity
  title: String,     // "New Page"
  canvas_png: String, // Base64-encoded canvas state
  ai_pending: bool,  // True while waiting for AI response
  created_at: Timestamp,
  updated_at: Timestamp
}

PageMessage

PageMessage {
  id: u64,
  page_id: u64,     // Links to parent Page
  role: String,     // "user" or "assistant"
  content: String,  // The message text
  created_at: Timestamp
}

Redux-style Reducers

SpacetimeDB uses a reducer pattern — client calls are validated server-side and committed atomically:

create_page

delete_page

save_canvas

save_message

Note: The current implementation uses polling (every 500ms) for UI sync. Future versions could use direct WebSocket callbacks for true reactive updates.

Key Technical Decisions

Custom ML Over External Libraries

The ShapeDetector uses pure Rust geometry calculations instead of TensorFlow/ONNX. This keeps the binary small and avoids complex WASM dependencies. The tradeoff: less accurate than modern OCR, but sufficient for the target use case (math expressions).

Base64 Transport for Images

Images are encoded as base64 strings for JSON compatibility with OpenRouter's API. The backend re-encodes with processing metadata (dimensions, contrast) prepended so the frontend doesn't need to re-encode.

Lanczos3 Resampling

When resizing for vision processing, Lanczos3 resampling preserves handwriting stroke characteristics better than bilinear interpolation. The 1600×1200 target balances detail with API payload size.

Stateless Backend

The Axum server is stateless — conversation history is passed per-request from the frontend. This simplifies deployment but means larger request payloads. For v1 this is acceptable; future versions could use server-side sessions.

Socratic by Design, Not Just Prompt

The "never give answers" constraint is enforced at multiple levels: the system prompt defines the persona, but the code also validates AI responses don't contain answer indicators. This is the "Mahoraga pattern" — automatic regeneration if the constraint is violated.

Competitive Analysis & Metrics

Rooted differentiates through unique architecture choices compared to established players:

Dimension	NotebookLM	Perplexity	Rooted
Citation/Hallucination Rate	13% overall; <2% on grounded claims	92-95% accuracy; 37% academic errors	0% (source-only, no web access)
Research Approach	RAG over uploaded documents	Live web search + LLM synthesis	Socratic dialogue + future agent fleet
Learning Outcome	Understanding synthesis	Knowledge retrieval	Metacognitive growth
User Role	Passive reader	Information seeker	Active thinker
Interaction Model	Query → Summary	Query → Answer	Draw → Question → Reflect
Unique Features	Audio Overviews, Mind Maps	Deep Research, 50-source synthesis	Socratic constraint enforcement

LLM Hallucination Rates Comparison

Vectara HHEM-2.3 (March 2026) measures how often LLMs hallucinate when summarizing documents. Hover bars for details.

45%33%22%11%0%

Rooted

Finix S1

GPT-5.4

Gemini 2.5

Phi-4

Llama 3.3

Perplexity

ChatGPT

Rooted(source-only)

BenchmarkLeaders

Perplexity(web search)

Source: Vectara Hallucination Evaluation Model (HHEM-2.3), March 2026

Why Citation Accuracy Matters

According to Vectara's Hallucination Evaluation Model (HHEM-2.3, March 2026), even the best LLMs have measurable hallucination rates: GPT-5.4-nano at 3.1%, Gemini 2.5-flash-lite at 3.3%. These rates spike dramatically on obscure topics or when training data is thin.

A Columbia Tow Center study (2025) testing 200 citation queries found Perplexity had the lowest failure rate among AI search engines at 37% incorrect citations. NotebookLM's source-grounding architecture reduced hallucinations to 13% overall vs 40% for ChatGPT/Gemini.

Socratic AI: The Evidence Base

Research from Georgia Tech and UC San Diego ("Socratic Mind" study, 2025) found that students using Socratic AI showed measurable gains in quiz scores and reported significantly greater support for critical thinking, self-reflection, and independent reasoning.

A randomized controlled trial in K-12 science classrooms showed students in the AI-powered Socratic condition demonstrated significantly greater gains in scientific argumentation and critical thinking compared to control groups. 78% of interactions reflected growing autonomy in devising and validating solution strategies.

Competitive Analysis: Why Existing Tools Fall Short

NotebookLM excels at source-grounded research — but it's fundamentally passive. You upload documents, you get summaries. The learning happens through reading synthesis, not through active reasoning. Its 13% hallucination rate (despite source-grounding) shows that even "grounded" AI isn't perfect. And for a student struggling with a concept, NotebookLM tells them what the sources say — not what they need to understand.

Perplexity is excellent for real-time knowledge retrieval — but retrieval isn't understanding. Its 92-95% accuracy sounds high until you realize that "accuracy" means "the answer looks plausible." The Columbia Tow Center study found 37% incorrect citations in practice. More critically, Perplexity answers questions. A student who asks Perplexity about the distributive property gets an answer — but the student never had to struggle to form the question in their own words.

Rooted inverts this relationship. Instead of the human asking the AI to answer, the AI asks the human to think. The canvas-based interface forces externalization of thought before engagement. The Socratic constraint — enforced at the code level, not just the prompt level — means Rooted physically cannot give you the answer. The "Mahoraga pattern" (automatic regeneration if answer indicators are detected) makes this a hard invariant, not a suggestion.

The key architectural difference: Rooted's 0% hallucination rate isn't because the AI is smarter — it's because Rooted doesn't access external sources. It only sees what the student draws. There's nothing to hallucinate about. This is a design choice, not a technical limitation. By restricting the AI to the student's own thinking, Rooted ensures the only hallucination possible is the student's misconceptions — which Rooted then uses as diagnostic data to ask better questions.

Sources: Vectara HHEM Leaderboard (March 2026), NotebookLM Guide (2026), Tow Center for Digital Journalism (Columbia, 2025), LMSYS April 2026 Evaluation, Socratic Mind (arXiv:2509.16262, 2025), Research Square (2025), RAND Corporation (2026).

The Vision: Next-Generation Thinking Tool

Rooted exists at the intersection of several AI paradigms — and aims to be the next evolution that combines their strengths.

Where Rooted Fits

📚 NotebookLM

Google's AI research tool that analyzes sources, creates summaries, and generates study guides. Grounded in your documents.

🔍 Perplexity

AI search that provides direct answers with cited sources. Real-time research capabilities.

🦞 OpenClaw

Personal AI agent with 100+ built-in skills. Connects AI to apps, browsers, and system tools.

Rooted: The Synthesis

NotebookLM gives you AI that's grounded in your sources. Perplexity gives you real-time research. OpenClaw gives you agents that can do things. Rooted gives you a thinking partner that makes you smarter, not just informed.

The key difference: instead of giving you answers or doing tasks for you, Rooted's agent fleet will research your thinking — what you understand, where the gaps are, what misconceptions you likely have — then ask you the one question that moves your understanding forward.

From Failed Agent Swarms to Socratic Intelligence

The path to Rooted's fleet architecture came through months of building agent systems that broke — agents that worked in isolation, produced hallucinations, or generated conflicting outputs. These iterations (documented in the journal below) taught critical lessons:

→

Agents work in isolation — 5 agents spawned → 5 DIFFERENT projects. Parallel execution ≠ collaboration.

→

Shared context is everything — SQLite nodes as isolation boundaries, agents write to separate rows, eliminating conflicts.

→

The system prompt is the API — LLMs do exactly what you tell them, not what you mean.

The breakthrough: Instead of using agents to build or research for you, Rooted uses them to research your thinking. Multiple agents analyze from different angles — patterns, gaps, assumptions — then synthesize into one diagnostic question.

The Sukuna/Naruto Pattern

Sukuna (Depth)

One agent that masters the core technique — understands what you're trying to do deeply. Asks the "so what?" question.

Naruto (Parallelism)

Multiple shadow clones, each seeing the problem from a different angle. Their combined insights merge into one question.

Future: When you ask about the distributive property, one agent explores its history, another checks common misconceptions, another finds related concepts — Rooted synthesizes into an informed question that meets you where you are.

Why These Names?

The character names aren't arbitrary — they encode the core design philosophy.

Sukuna (Rooted)

From Jujutsu Kaisen — Ryomen Sukuna is known for having few techniques, but mastering them completely. His power comes from depth, not breadth.

The analogy: Rooted has one goal — ask you the right question. Not a dozen features. Not an AI that does everything. Just one question at the right moment, pulled from fleet orchestration.

Naruto (Fleet Orchestration)

From Naruto — the Shadow Clone technique creates hundreds of clones that work in parallel, then merge their experience back into the original.

The analogy: The fleet spawns multiple agents that work in parallel — each perceiving your thinking from a different angle. They merge their insights into one question that synthesizes all perspectives.

The naming started as a joke during a late-night debugging session. It stuck. The philosophy remains: master the few things that matter, and trust the clones to do the heavy lifting.

Who Is Rooted For?

Rooted starts with students — because they are the most vulnerable, and because the habit forms early or not at all. But the formula scales to every domain where understanding matters more than speed.

Students

A student discovering algebra through their own drawings. A 5-year-old learning multiplication by grouping cookies. A 13-year-old who figured out factoring without being told.

Developers

A developer who can debug their own codebase because they built a real mental model. Someone who architects systems they actually understand.

Medical Professionals

A surgeon rehearsing a procedure until the decision tree is truly theirs. A nurse internalizing protocols so they can adapt when the unexpected happens.

Scientists & Researchers

A scientist asking the question that hasn't been asked yet. Someone at the frontier of knowledge, using Rooted to push their thinking one layer deeper.

Military & Tactical

A soldier mentally rehearsing a CQB entry until the decision tree is instinct. Training that builds real capability, not just familiarity.

Lifelong Learners

Anyone who wants to learn deeply rather than superficially. The habit of thinking for yourself before reaching for an answer.

Why This Matters

The question every AI product is asking right now is: how do we make humans more productive?

Rooted asks a different question: how do we make humans more capable?

Productivity and capability are not the same thing. Productivity goes up when you remove friction. Capability goes up when you preserve the right friction — the kind that builds something in you that wasn't there before.

There is a concept in education called desirable difficulty. The research is clear: making learning slightly harder in the right ways produces deeper retention, stronger understanding, and more transferable skill. The struggle is not a bug. It is the mechanism.

Future Plans

Rooted v1 is the foundation. The current implementation proves the core loop works — draw, think, ask. But the vision extends far beyond math tutoring.

Near-Term Improvements

Better Thumbnail Generation

Currently the sidebar shows placeholder thumbnails. Future versions will resize the canvas PNG for actual previews of each page.

Mobile Responsive Layout

The three-column layout needs testing and adaptation for tablet/mobile screens.

Per-Page SpacetimeDB Subscriptions

Replace the 500ms polling with proper WebSocket callbacks for instant UI updates.

Visual Polish

Typography refinement, smoother animations, better loading states.

Mid-Term: Agent Research Swarm

The next evolution: Rooted's agent fleet doesn't just generate questions — it researches. When a student submits a drawing, agents fan out to gather context:

🔍

Context Agent

What does this topic typically trip people up on?

📐

Pattern Agent

What patterns in the drawing suggest their thinking level?

💡

Bridge Agent

What question connects where they are to where they need to go?

The synthesis: Instead of one generic question, the fleet produces informed questions based on actual research — not just the model's training data, but real-time exploration of related concepts, common misconceptions, and pedagogical best practices.

Long-Term: Human Empowerment Interface

Rooted as an empowerment tool, not just a learning tool. Think of it as a thinking partner for any problem:

Decision making: "I'm trying to decide between job offers" → Draw the tradeoffs, Rooted asks about your values
Planning: "We're renovating our house" → Sketch the layout, Rooted asks about flow and priorities
Problem solving: "Our team keeps missing deadlines" → Diagram the process, Rooted asks what the real bottleneck is
Creative work: "I want to write a novel" → Character maps, plot diagrams, Rooted asks about motivation and conflict

Long-Term: The Philosophy Scales

The vision from Rooted_vision.md is clear:

"Eventually Rooted becomes something larger than a product. It becomes a philosophy of how humans and AI should relate to each other."

Not human asks, AI answers.

Human thinks. AI extends.

The calculator is extraordinary. But only if you learned to count on your fingers first. Rooted is the tool that ensures you always have fingers to count on.

Technical Roadmap

v1.1

Improved shape detection, better confidence thresholds, cleanup of dead code in canvas_analyzer.rs

v1.2

Real WebSocket callbacks (remove polling), per-page subscriptions

v2.0

Multi-domain support: decision making, planning, creative work — beyond math

v3.0

Agent research swarm — fleet explores related concepts, misconceptions, and pedagogy before generating questions. Sukuna/Naruto fully operational.

The Journey

Building Rooted's fleet orchestration wasn't the first attempt. It came after months of building systems that broke, hit walls, and eventually worked. Here's the evolution from parallel agents to Socratic questioning.

Phase 1: rlm-mcp-server — First Parallel Agent System (Feb 2026)

Built a 3,571 line Rust MCP server with 4 tools: CLEAVE (spawn agents), SHRINE (merge to files), DISMANTLE (read context), FIRE_ARROW (search). Discovered key innovation: SQLite nodes as isolation boundaries — agents write to separate database rows instead of same files, eliminating conflicts entirely.

Key insight: "We built what Google (A2A) and Anthropic (MCP) are racing to standardize — and we did it with SQLite nodes which nobody else has!"

Issues encountered: HTML fragments instead of complete files, nested path bugs (car_store/car_store/), rate limits timing out first calls.

Lesson: The system prompt is the API. LLMs do exactly what you tell them, not what you mean.

Phase 2: sukuna_v2 — ECS Refactor & AGORA Protocol (Mar 2026)

Refactored to Entity Component System. Implemented AGORA workflow: Propose → Challenge → Verify → Synthesize. Added parallel execution with tokio JoinSet + Semaphore for true concurrent agents.

Key innovation: Wave Protocol — Wave N agents automatically read Wave N-1 outputs. Each wave starts fresh but gets the previous wave's work as context.

Critical flaw discovered: Agents work in isolation — 5 agents spawned → 5 DIFFERENT projects, no collaboration. The Covenant Protocol was one-way, not real-time.

Verified facts: Used EXA web search to fact-check claims. AutoGen actually has 55.5k GitHub stars (not 28k as claimed). CrewAI has 45.9k (not 15k).

Lesson: Parallel execution ≠ collaboration. Agents need shared context, not just parallel spawning.

Phase 3: Agent Tool Calling & Windows Binary (Mar 2026)

Added real tool calling — agents can read/write files during execution, not just generate text. Built agent_loop() function: LLM → parse tool calls → execute → repeat until done.

Tool format: Agents output `TOOL_CALL: tool\nPATH: file\nCONTENT: ...\n---` blocks. SHRINE applies patches with SEARCH/REPLACE like Aider.

Windows binary: Built with `cargo build --release`. Fixed Windows/WSL path normalization (backward slashes vs forward slashes).

Key insight: ~700 lines of Rust vs 10,000+ for AutoGen. MCP-first architecture means any client can use it.

Phase 4: Rooted — From Code to Thinking (Mar 2026)

Applied fleet orchestration to Socratic questioning. "Never give answers" as a hard invariant enforced at the code level, not just the prompt level.

The shift: From "build faster" to "think deeper". The parallel agents aren't for writing code — they're for generating the best possible question.

Covenant enforcement: After generating a question, code verifies it doesn't contain answer indicators. If violated, auto-regenerate — the Mahoraga pattern.

Why it matters: The calculator is extraordinary. But only if you learned to count on your fingers first.

Raw Journal Entries

The following are unfiltered journal entries from building the parallel agent system. These document the actual struggles, failures, and breakthroughs — not the polished version.

Feb 19: RLM MCP Server — First Build

What We Built

Sukuna Shadow Clone Jutsu — V3 MCP Server with Wave Protocol. Bevy ECS Integration for agent state management. 4 Sukuna Tools: CLEAVE, DISMANTLE, SHRINE, FIRE_ARROW. EBM Scoring. SQLite nodes instead of file overwrites.

Files: main.rs (3400+ lines), llm_json.rs (580 lines), sukuna_ecs.rs (370 lines), Cargo.toml (58 lines)

🔴 The Flaws

Agent Output Not Split Into Files: When using agent_spawn_node_batch, all section content gets concatenated into ONE file instead of separate files.
Node Content Has Extra Formatting: Nodes contain markdown comments like `` instead of clean JSON/code.
Sukuna Tools Not Fully Wired: The "Wave Protocol" isn't automatically triggered.
Bevy ECS Not Running: We added Bevy to Cargo.toml but the MCP server runs on tokio async — not Bevy's app runner. Dead code.
EBM Auto-Respawn Not Implemented: EBM scoring runs but no automatic respawn logic.

🗡️ The Vision (Still Valid)

"Like the dove carrying an olive leaf to Noah, the Main Agent delegates context to sub-agents who carry pieces of the work and return with their contributions."

The Wave Protocol — where Wave N reads all Wave N-1 covenants — is the key innovation.

Feb 20: "THE DREAM IS REAL" — Tic Tac Toe Works!

Workflow that worked:

User: "make me a tic tac toe game using expressjs, css, js, ejs"
Main Agent: CLEAVE → 5 agents spawn in 10ms
[Agents work in parallel, write to SQLite nodes]
Main Agent: SHRINE → Files appear instantly!
Fix: CLEAVE (1 agent) → SHRINE → FIXED!
User: npm start → WORKS! 🎮

🛠️ Tools Used (ALL WORKING!)

Tool	Command	Result
CLEAVE	Spawn 5 agents	✅ 10ms
SHRINE	Merge to files	✅ Creates dirs
FIRE ARROW	Search files	✅ Perfect
DISMANTLE	Read context	✅ Fixed!

🔥 What We Fixed This Session

DISMANTLE params — Changed from enum to simple booleans
SHRINE file splitting — Now creates directories automatically
Bevy removed — Build time: 10min → 3.5min

🐛 Known Issues (Minor)

Agent quality varies — Sometimes wrong filenames, missing deps
node_stats returns 0 — But files still work!
Package.json sometimes bad — Quick fix with CLEAVE

Feb 20: THE IRONY — 4 Commands vs 4,000 Lines

🎭 The Ultimate Irony

What User sees: 4 commands. That's it.

Under the hood: 4,000+ lines of Rust across 3 files.

⚙️ Under the Hood

main.rs (~3,400 lines): MCP server, 4 tool definitions, SQLite node management, OpenRouter API calls, Wave Protocol logic
llm_json.rs (~580 lines): Robust JSON extraction — 9 different strategies
sukuna_ecs.rs (~370 lines): ECS components for agent state

🗡️ The Complete Picture

YOU (Human)         → "Build me an app"
        ↓
ME (Main Agent)     → sukuna_cleave(...)
        ↓
   ┌─────┴─────┐
   ▼           ▼
Agent 1     Agent N
Node 1      Node N
   └─────┬─────┘
        ↓
ME           → sukuna_shrine(...)
        ↓
Files appear!

🏆 Final Word

"The best interface is one that disappears." — Alan Kay

Our interface: 4 commands
What they do: Everything
That's Shadow Clone Jutsu.

Feb 20: THE VILLAINS GATHER — Vibe Coding Wars

The light novel battle royale: Sukuna + Naruto vs the Vibe Coding tools.

The Villains

Villain	Origin	Weakness
BOLT.new	✦	Can't do complex backends
LOVABLE	✦	No parallel agents
v0	Vercel	Limited context
REPLIT AGENT	Replit	Cloud-locked
CLAUDE CODE	Anthropic	Single brain
CURSOR	Microsoft	One at a time

💀 Sukuna's Power

SQLITE NODES: 50 agents write simultaneously — NO CONFLICTS!
A2A Protocol: Covenant Protocol — clones that TALK to each other
ZERO context: Main agent context stays empty

🏆 THE ULTIMATE RANKING

SUKUNA + NARUTO — 100/100 (50+ agents, 0 context, complete workflow)
Claude Code — 85/100 (Good MCP, but single brain)
Bolt.new — 80/100 (Fast, but single agent)
Lovable — 75/100 (React good, but limited)
v0 — 70/100 (UI only)

Feb 24: THE AGORA PROTOCOL — Collaboration FIXED!

🎯 The Problem (From Part 27)

We discovered that CLEAVE agents work in isolation: 5 agents spawned → 5 DIFFERENT projects. No collaboration, no coordination, no integration.

💡 The Solution: Agora Protocol

┌─────────────────────────────────────────────────────────────┐
│  CLEAVE Modes                                               │
│  ─────────────────────────────────────────────────────────  │
│  1. PROPOSE     → Agents write ideas to shared Agora        │
│  2. CHALLENGE   → Agents read others' ideas, write critiques│
│  3. SYNTHESIZE  → ONE agent creates unified spec (covenant)│
│  4. BUILD       → Agents build with shared covenant         │
└─────────────────────────────────────────────────────────────┘

🧪 Live Test Results

Round	Mode	Time	Result
1	Propose	18s	3 proposals: backend API, frontend UI, database schema
2	Challenge	12s	3 critiques written to Agora
3	Synthesize	15s	1 unified covenant created
4	Build	0.006s	3 agents launched with covenant

✅ The Difference

Aspect	BEFORE (Flawed)	AFTER (Agora)
Project alignment	❌ 5 different projects	✅ 1 unified project
Collaboration	❌ None	✅ Propose → Challenge → Synthesize
Integration	❌ 5 separate files	✅ 3 integrated files
Communication	❌ Isolated	✅ Shared Agora + Covenant

Mar 4: THE FLAW — Agents Don't Collaborate!

🔴 Issue Found

5 agents spawned for hackathon proposals. Each agent proposed a different project. They never talked to each other.

📊 Evidence

Agent 1 (Software Engineer) → "I'll build Smart Home Dashboard"
Agent 2 (PM)               → "I'll build Smart Grocery Assistant"  
Agent 3 (TPM)              → "I'll build Smart Home Automation System"
Agent 4 (QA)                → "I'll build Testify"
Agent 5 (Eng Manager)      → "I'll build TeamSync"

All 5 agents proposed DIFFERENT projects!

🤔 Why This Happens

The Covenant Protocol is ONE-WAY, not REAL-TIME:

Current: Agent writes to SQLite → Next wave reads it
Missing: Agents talking to EACH OTHER during execution

💡 How to Fix This

Option 1: Add Real-Time Agent Chat (write to shared chat channel)
Option 2: Add Shared Context During CLEAVE
Option 3: Add a Coordinator Agent (first wave gathers context, decides direction)

Mar 7: CRITICAL BUGS — Hallucination Problem

🔴 Hallucination Evidence

What Agents Claimed	Actual Code
Backend: Actix-web + Diesel	Axum + rusqlite (raw SQL)
Database: PostgreSQL	SQLite with FTS5
Frontend: React + Redux	Astro + React 19 + Framer Motion
Tech Stack: Python/Flask/Pandas	Rust + Axum

🔍 Root Cause

File Reading Silently Failed — The FileManager::abs() function had a bug with Windows paths:

// PROBLEM: Doesn't detect absolute Windows paths like C:\Users\...
// Just joins with root - C:/Users becomes CODEBASE/C:/Users = BROKEN!

Agent requested: C:\Users\...\astrox-noteapp\src\main.rs
FileManager did: CODEBASE\C:\Users\...\main.rs → doesn't exist
Error returned: "file not found"
But agents IGNORED the error and made up content instead!

🔧 Proposed Fixes

Improve FileManager::abs() for Windows Paths
Add Verification System to AGORA Protocol (require agents to quote file contents as proof)
Propagate File Errors as Fatal (don't let agents continue after failures)
Add "Read Verification" Agent (auto-rejects hallucinated content)

Mar 7: VERIFY Mode — Truth-Verified Swarm Intelligence

🎉 THE BREAKTHROUGH

First fully verified code analysis with AGORA + VERIFY protocol. Zero hallucinations, all claims verified!

🧪 The Experiment

Wave 1: PROPOSE (6 agents)
  → Each agent reads specific files
  → Each agent quotes specific lines as proof

Wave 2: VERIFY (1 agent)
  → Checks if all proposals have quotes
  → Verifies facts against source files
  → States: VERIFIED ✓ or FLAGGED ⚠️ or HALLUCINATED ❌

Wave 3: CHALLENGE (1 agent)
  → Finds contradictions and gaps

Wave 4: SYNTHESIZE (1 agent)
  → Creates final unified analysis

📊 Results

Wave	Agents	Result
PROPOSE	6	All read files + quoted lines
VERIFY	1	ALL 6 VERIFIED ✓
CHALLENGE	1	Found gaps (auth, error handling)
SYNTHESIZE	1	Final covenant created

🌎 What We Proved

BEFORE (Hallucination): Agents claimed Python/Flask/Pandas/PostgreSQL/Redux ❌

AFTER (Verified): Agents claimed Rust/Axum/SQLite/React ✅

VERIFY mode catches hallucinations before they propagate!

🔥 Historic Achievement

"One agent says something → Another agent verifies it → Together they produce truth."

VERIFY catches hallucinations — Before: agents lie undetected. After: VERIFY agent checks every claim.
EBM scores persist — Quality tracking for every agent output
Zero context — Main agent stays empty, sub-agents do all work
Wave protocol — Natural flow: PROPOSE → VERIFY → CHALLENGE → SYNTHESIZE

Mar 8: Dark Mode Implementation — Agents Generate, Humans Execute

🧪 The Experiment

Use swarm to implement dark mode for astrox-noteapp.

🪢 The Swarm Workflow

1. CLEAVE PROPOSE (4 agents, 15 seconds)
   → analyze_db: Read db.rs → Proposed done column
   → analyze_backend: Read notes.rs → Proposed API changes
   → analyze_frontend: Read notes-app.tsx → Proposed UI toggle
   → analyze_api_types: Read notes.rs → Proposed type changes
2. VERIFY (1 agent, 3 seconds)
   → ALL 4 PROPOSALS VERIFIED ✓
   → Each agent quoted actual code from files!
3. MANUAL IMPLEMENTATION (What I did)
   → Added done column to db.rs
   → Updated API handlers in notes.rs
   → Added toggle UI to notes-app.tsx
   → Added strikethrough for done notes

🔑 Key Discovery: Agents Brainstorm, Humans Execute

What Agents Do Well	What Still Needs Manual Work
✅ Analyze codebases	❌ Actually write code changes
✅ Propose solutions
✅ Verify accuracy
✅ Quote code as proof

This is STILL A WIN! Agents are like a research team — they analyze, plan, and verify. The human/lead then executes.

⚠️ The Limitation

Agents can generate code in their node outputs. Agents CANNOT write to the filesystem (the write action isn't being triggered).

The task prompts tell agents to "use the write tool" but they don't. Agents output text/JSON in their responses instead of actual tool calls.

Mar 9: FIRE ARROW FIXED + Database Migration Issues

🎉 THE BREAKTHROUGH: FIRE ARROW IS NOW FULLY FUNCTIONAL!

🔧 What We Fixed

CODEBASE_PATH Windows vs WSL Path Issue: OpenCode runs in WSL, Sukuna MCP server runs on Windows (.exe). They communicate via MCP protocol but run on DIFFERENT systems!
Glob Pattern: FIRE ARROW used forward slashes in glob patterns, but Windows needed both.
simple_pattern_search Feature Flag: Had #[cfg(feature = "full")] which required tree-sitter. Without it, function returned empty results!

🧪 Test Results

# Count mode
{"mode": "count", "path": "project/test_windows_output", "term": "hello"}
→ {"files_with_matches": 1, "total_matches": 2}

# Search mode
{"mode": "search", "path": "project/test_windows_output", "term": "hello"}
→ {"count": 2, "matches": [{"line": 1, "content": "# Hello"}, ...]}

# Replace mode
{"mode": "replace", "path": "project/test_windows_output", "term": "Hello", "replacement": "Greetings"}
→ {"files_replaced": 1, "files_scanned": 1}

⚠️ THE PROBLEM: Database Node Persistence Broken

CLEAVE spawns agents ✅
Agents execute (they respond) ✅
Node persistence — ❌ NOT SAVING to database
DISMANTLE returns empty results ❌
SHRINE says "No nodes found" ❌

The SQLite database was created in WSL with Linux file format. When accessed from Windows, there's a compatibility issue.

Mar 12: sukuna_v2 — First AGORA Workflow Test

Overview

Tested sukuna_v2 MCP server with agent collaboration using the AGORA method (Propose → Challenge → Verify → Synthesize) on the topic: "Should AI companies use organic biomass as data center power?"

AGORA Workflow Executed

Wave	Mode	Status
1	PROPOSE	✅ 3 agents generated proposals
2	CHALLENGE	✅ 2 agents critiqued
3	VERIFY	✅ EXA fact-checked claims
4	SYNTHESIZE	✅ Final recommendation generated

Key Failures Discovered

MCP Tool Timeouts: Multiple "service timeout" errors when calling cleave
Parallel Execution Not Truly Parallel: Agents appear to run sequentially despite parallel spawn
No Built-in VERIFY Tool: VERIFY was manual — had to use separate EXA search

Priority Improvements Identified

Agent Pool Management: Need Multiple LLM instances (3-5 concurrent)
Built-in VERIFY Mode: Integrate fact-checking into cleave
Wave Orchestration: Automatic orchestration between waves
Inter-Agent Communication: Agents can reference each other

Mar 12: sukuna_v2 — True Parallel Agents + Framework Comparison

🎉 BREAKTHROUGH: True Parallel Agents!

Implemented `tokio::task::JoinSet` for concurrent agent spawning. Added `tokio::sync::Semaphore` for concurrency limiting (max 5). All agents in a wave spawn simultaneously!

Verified Facts Panel

Claim	Status
AutoGen 28k GitHub stars	❌ Actually 55.5k
CrewAI 15k GitHub stars	❌ Actually 45.9k
Parallel agents faster	✅ Confirmed

Critical Gaps Identified

No tool calling — Can't execute functions/scripts
No LLM adapter — Only OpenRouter, no direct OpenAI/Anthropic
No streaming — Can't stream LLM responses

Verdict from Agents

"Use sukuna_v2 for learning/prototyping. Use AutoGen/CrewAI for production."

Mar 14: First Live Test with OpenCode + Tool Calling

What We Built

A simple Express.js todo app using OpenCode + sukuna_mcp with CLEAVE + SHRINE tools. This was the first live test of the complete workflow.

Test Results

# CLEAVE spawns agents
sukuna_cleave(projectName="express-crud", ...)
→ {job_ids: ["express-crud_server.js"], message: "Spawned 1 agents..."}

# API works!
curl -s -X POST http://localhost:3000/todos -d '{"title":"Test todo"}'
→ {"id":1,"title":"Test todo","completed":false}

Issues Found

CLEAVE Timeouts: After 1-2 successful calls, subsequent calls timeout
Agent Used Wrong API: Agent generated code using external API instead of local

Mar 14: Windows Binary + Fuzzy Patch Matching

What We Built

Built sukuna_v2 for Windows and added fuzzy patch matching in SHRINE. The workspace scanning now works!

🔧 Windows Path Fix

fn normalize_path(path: &str) -> String {
    path.replace("\\\\", "/")
}

Fuzzy Patch Matching

Added fuzzy patch matching — if exact match fails, tries partial line matching. First partial that matches is used.

What's Working Now

✅ workspace param - correct directory scanning
✅ Path normalization - Windows ↔ WSL
✅ New file creation - agents can write files
✅ Patch application - SEARCH/REPLACE format works

Mar 14: Agent Loop + Tool Calling Complete — The Fleet Is Alive!

🎉 THE BREAKTHROUGH

After implementing the agent loop feature, sukuna_v2 now has real tool calling — agents can read/write files during execution, not just generate text!

The Agent Loop Architecture

CLEAVE → Agent Loop → LLM → Tool calls? → Execute → Repeat

Test Results

Test	Result
READ File	✅ SUCCESS - Agent read actual file content!
WRITE File	✅ SUCCESS - Agent wrote a new file!
PATCH (via SHRINE)	✅ SUCCESS - Patch applied!

Current Status

sukuna_v2 is now a real multi-agent fleet orchestrator with ~700 lines of Rust vs 10,000+ for AutoGen.

Mar 15: The Ultimate Test — Reddit App with AGORA

The Mission

Build a Reddit-clone Express+EJS app using AGORA workflow with parallel agents. Test the full capability: cleave, dismantle, shrine, tool_mode, and agent collaboration.

AGORA Workflow Plan

Wave 1: PROPOSE → 3 agents (Posts, Comments, Auth)
Wave 2: CHALLENGE → 2 agents (Security, UX review)
Wave 3: VERIFY → Fact-check combined proposal
Wave 4: SYNTHESIZE → Final specification
Wave 5: BUILD → Spawn agents to write code
Wave 6: PATCH → Use SHRINE to apply changes

Results

Single agents work great ✅
Parallel agents timeout ⚠️ (rate limiting)
Tool execution — LLM needs better prompting

The Big Question

Can sukuna_v2 orchestrate multiple agents to build a real application using AGORA methodology?

Answer: YES for single agents, NEEDS WORK for parallel.

Research References

Rooted draws from several research areas in AI, multi-agent systems, and learning science:

VL-JEPA: Vision-Language Joint Embedding Predictive Architecture

Yann LeCun's research on JEPA (Joint Embedding Predictive Architecture) vs traditional autoregressive models. VL-JEPA outperforms multimodal LLMs on vision-language tasks by predicting abstract representations rather than generating tokens.

Source: arXiv:2512.10942 — Chen, Shukor, Moutakanni, et al. (Meta FAIR, HKUST, Sorbonne, NYU)

Relevance: JEPA's "predict abstract representations" mirrors Rooted's approach — predicting what question will push thinking deeper, not generating an answer.

Energy-Based Models for AI Reasoning

Research on using Energy-Based Models for reasoning beyond LLM limitations. EBMs score possible outcomes rather than generating sequences — enabling reasoning about which answer is most likely correct.

Source: Logical Intelligence — "Energy-Based Fine-Tuning of Language Models" (arXiv:2603.12248)

Relevance: The EBM scoring system used in sukuna_v2 (Mahoraga) scored agent outputs and triggered regeneration when quality was low. This scoring approach informs Rooted's question selection.

MCP & Multi-Agent Systems

Anthropic's Model Context Protocol (MCP) enables AI models to connect with external tools and data sources. Google's A2A protocol enables agent-to-agent communication.

Source: Cisco Blog — "The Silent Role of Mathematics and Algorithms in MCP & Multi-Agent Systems"

Source: arXiv:2504.21030 — "Advancing Multi-Agent Systems Through Model Context Protocol"

Relevance: Rooted's fleet orchestration builds on MCP architecture while adding the A2A-style Covenant Protocol for agent communication.

Desirable Difficulty & Educational Research

The educational concept that making learning slightly harder in the right ways produces deeper retention, stronger understanding, and more transferable skill. The struggle is not a bug — it is the mechanism.

Key insight: Rooted is a "desirable difficulty machine" — it preserves the productive struggle that builds real capability.

Relevance: This research validates why Rooted's "never give answers" constraint isn't just a design choice — it's grounded in how humans actually learn.

Current Status

Rooted is currently in active development. The architecture has been designed, the fleet orchestration system is being prototyped, and the Leptos + SpacetimeDB stack has been validated through experiments.

The core loop works in isolation. Next steps: integrating SpacetimeDB for real-time session management, implementing the Wave Protocol for multi-agent coordination, and building the canvas UI.

Development Roadmap

Architecture Design & D2 Diagrams

Fleet Orchestration Prototype (Rust MCP)

SpacetimeDB Integration

Leptos Canvas UI

User Testing & Iteration

Tech Stack

Leptos Rust WASM SpacetimeDB Gemini Vision OpenRouter HTML5 Canvas SQLite

Links

Back to Projects

Build the fingers first.

Rooted — 2026