Jon Moshier

Multi-LLM Coordinator — Idea

Cost Aware agent harness that runs in the command line. Claude Code clone project written in Rust.

Multi-LLM Coordinator — Idea

Update, this became Pantheon which is a fun little project written in Rust. I punted on the original idea once I found out about OpenRouter Auto and now Pantheon lives on as a Claude Code “clone” that can support multiple models.


Status: Built, pivoted from original idea.

Core angle: Personal LLM proxy with cost-aware routing

The Problem

Claude Pro ($20/mo) hits token limits too fast. Upgrading to $100/mo is wasteful when most prompts don’t need that level of intelligence. Most LLM usage is routine — summarizing, rephrasing, casual Q&A, simple lookups — and free/cheap models handle that just fine.

The Idea

A unified chat frontend that routes each request to the cheapest model capable of handling it well. You chat normally; the system decides what goes to Gemini Flash vs Claude Haiku vs Claude Sonnet behind the scenes.

Not cheating — this is exactly what Cursor, Perplexity, and similar products do internally.


Architecture

You → Chat UI / CLI

    Task Classifier (cheap model or heuristic)

  ┌──────┼──────────┐
  ↓      ↓          ↓
Flash  Haiku     Sonnet
(free) (cheap)  (when needed)

    Response streamed back

Routing Tiers

TierModelUse When
FreeGemini 2.0 Flash, Groq/Llama 3Casual chat, rephrasing, summarizing, factual Q&A
CheapClaude Haiku, GPT-4o-miniLight coding, structured tasks, short writing
FullClaude SonnetComplex reasoning, nuanced writing, hard coding problems

Classifier Options

  • Rule-based heuristics (prompt length, keywords) — simplest
  • Small cheap model (Haiku or Flash) to classify — smarter but adds latency
  • Let the user explicitly tag with @fast, @heavy, etc. — most transparent

Interesting Extensions (later)

  • Debate mode: Model A answers → Model B critiques → A revises (for high-stakes questions)
  • Ensemble: Multiple models answer independently, show or aggregate results
  • Cost dashboard: Track spend per model, per week

Prior Art to Study


Free/Cheap Model Options

  • Gemini 2.0 Flash — generous free tier, fast, surprisingly capable
  • Groq + Llama 3 — extremely fast inference, free tier available
  • GPT-4o-mini — cheap, good for structured output
  • Claude Haiku — cheapest Claude, still quality
  • Mistral — free API tier, good for EU-based concerns

If Building: Start Small

  1. CLI that accepts a prompt + optional --tier flag
  2. LiteLLM as the backend abstraction layer (handles all provider APIs)
  3. Simple keyword classifier to start
  4. Add cost tracking (log model used + token count per request)
  5. Build toward automatic routing once you have data on what you actually ask

Open Questions

  • CLI first or chat UI? (CLI is faster to build and test)

Solved / Non-Issues

  • Conversation history across model switches — just pass the full messages array; LiteLLM normalizes format differences. Only real edge case: if conversation exceeds a model’s context window, truncate or summarize. Not a blocker.
← All notes