Cost Aware agent harness that runs in the command line. Claude Code clone project written in Rust.

Multi-LLM Coordinator — Idea

Update, this became Pantheon which is a fun little project written in Rust. I punted on the original idea once I found out about OpenRouter Auto and now Pantheon lives on as a Claude Code “clone” that can support multiple models.

Status: Built, pivoted from original idea.

Core angle: Personal LLM proxy with cost-aware routing

The Problem

Claude Pro ($20/mo) hits token limits too fast. Upgrading to $100/mo is wasteful when most prompts don’t need that level of intelligence. Most LLM usage is routine — summarizing, rephrasing, casual Q&A, simple lookups — and free/cheap models handle that just fine.

The Idea

A unified chat frontend that routes each request to the cheapest model capable of handling it well. You chat normally; the system decides what goes to Gemini Flash vs Claude Haiku vs Claude Sonnet behind the scenes.

Not cheating — this is exactly what Cursor, Perplexity, and similar products do internally.

Architecture

You → Chat UI / CLI
         ↓
    Task Classifier (cheap model or heuristic)
         ↓
  ┌──────┼──────────┐
  ↓      ↓          ↓
Flash  Haiku     Sonnet
(free) (cheap)  (when needed)
         ↓
    Response streamed back

Routing Tiers

Tier	Model	Use When
Free	Gemini 2.0 Flash, Groq/Llama 3	Casual chat, rephrasing, summarizing, factual Q&A
Cheap	Claude Haiku, GPT-4o-mini	Light coding, structured tasks, short writing
Full	Claude Sonnet	Complex reasoning, nuanced writing, hard coding problems

Classifier Options

Rule-based heuristics (prompt length, keywords) — simplest
Small cheap model (Haiku or Flash) to classify — smarter but adds latency
Let the user explicitly tag with @fast, @heavy, etc. — most transparent

Interesting Extensions (later)

Debate mode: Model A answers → Model B critiques → A revises (for high-stakes questions)
Ensemble: Multiple models answer independently, show or aggregate results
Cost dashboard: Track spend per model, per week

Prior Art to Study

RouteLLM — Stanford/LMSYS router paper, routes based on difficulty
LiteLLM — unified API layer across all providers (good building block)
AutoGen — multi-agent, heavier
Mixture of Agents paper — MoA from TogetherAI

Free/Cheap Model Options

Gemini 2.0 Flash — generous free tier, fast, surprisingly capable
Groq + Llama 3 — extremely fast inference, free tier available
GPT-4o-mini — cheap, good for structured output
Claude Haiku — cheapest Claude, still quality
Mistral — free API tier, good for EU-based concerns

If Building: Start Small

CLI that accepts a prompt + optional --tier flag
LiteLLM as the backend abstraction layer (handles all provider APIs)
Simple keyword classifier to start
Add cost tracking (log model used + token count per request)
Build toward automatic routing once you have data on what you actually ask

Open Questions

CLI first or chat UI? (CLI is faster to build and test)

Solved / Non-Issues

Conversation history across model switches — just pass the full messages array; LiteLLM normalizes format differences. Only real edge case: if conversation exceeds a model’s context window, truncate or summarize. Not a blocker.