Multi-LLM Coordinator — Idea
Cost Aware agent harness that runs in the command line. Claude Code clone project written in Rust.
Multi-LLM Coordinator — Idea
Update, this became Pantheon which is a fun little project written in Rust. I punted on the original idea once I found out about OpenRouter Auto and now Pantheon lives on as a Claude Code “clone” that can support multiple models.
Status: Built, pivoted from original idea.
Core angle: Personal LLM proxy with cost-aware routing
The Problem
Claude Pro ($20/mo) hits token limits too fast. Upgrading to $100/mo is wasteful when most prompts don’t need that level of intelligence. Most LLM usage is routine — summarizing, rephrasing, casual Q&A, simple lookups — and free/cheap models handle that just fine.
The Idea
A unified chat frontend that routes each request to the cheapest model capable of handling it well. You chat normally; the system decides what goes to Gemini Flash vs Claude Haiku vs Claude Sonnet behind the scenes.
Not cheating — this is exactly what Cursor, Perplexity, and similar products do internally.
Architecture
You → Chat UI / CLI
↓
Task Classifier (cheap model or heuristic)
↓
┌──────┼──────────┐
↓ ↓ ↓
Flash Haiku Sonnet
(free) (cheap) (when needed)
↓
Response streamed back
Routing Tiers
| Tier | Model | Use When |
|---|---|---|
| Free | Gemini 2.0 Flash, Groq/Llama 3 | Casual chat, rephrasing, summarizing, factual Q&A |
| Cheap | Claude Haiku, GPT-4o-mini | Light coding, structured tasks, short writing |
| Full | Claude Sonnet | Complex reasoning, nuanced writing, hard coding problems |
Classifier Options
- Rule-based heuristics (prompt length, keywords) — simplest
- Small cheap model (Haiku or Flash) to classify — smarter but adds latency
- Let the user explicitly tag with
@fast,@heavy, etc. — most transparent
Interesting Extensions (later)
- Debate mode: Model A answers → Model B critiques → A revises (for high-stakes questions)
- Ensemble: Multiple models answer independently, show or aggregate results
- Cost dashboard: Track spend per model, per week
Prior Art to Study
- RouteLLM — Stanford/LMSYS router paper, routes based on difficulty
- LiteLLM — unified API layer across all providers (good building block)
- AutoGen — multi-agent, heavier
- Mixture of Agents paper — MoA from TogetherAI
Free/Cheap Model Options
- Gemini 2.0 Flash — generous free tier, fast, surprisingly capable
- Groq + Llama 3 — extremely fast inference, free tier available
- GPT-4o-mini — cheap, good for structured output
- Claude Haiku — cheapest Claude, still quality
- Mistral — free API tier, good for EU-based concerns
If Building: Start Small
- CLI that accepts a prompt + optional
--tierflag - LiteLLM as the backend abstraction layer (handles all provider APIs)
- Simple keyword classifier to start
- Add cost tracking (log model used + token count per request)
- Build toward automatic routing once you have data on what you actually ask
Open Questions
- CLI first or chat UI? (CLI is faster to build and test)
Solved / Non-Issues
- Conversation history across model switches — just pass the full messages array; LiteLLM normalizes format differences. Only real edge case: if conversation exceeds a model’s context window, truncate or summarize. Not a blocker.