A technical deep-dive into how ZeroLimitAI routes requests across multiple AI providers with automatic fallback, cost tracking, and tier enforcement.

The Problem: AI Provider Fragmentation

Every major AI provider — OpenAI, Anthropic, Google, Meta — has a different API, different pricing, different availability characteristics, and different strengths. Building an application that uses more than one means writing adapter code for each, handling different error formats, and managing multiple API keys.

ZeroLimitAI solves this with a unified AI router: a single interface that abstracts across all providers and routes requests intelligently.

The Unified Interface

Every provider is wrapped in a common interface:

interface AIProvider {
  id: string;
  chat(
    messages: ChatMessage[],
    options: ChatOptions
  ): Promise<ChatResult>;
  stream(
    messages: ChatMessage[],
    options: ChatOptions
  ): AsyncIterable<string>;
}

This means the rest of the application never knows which provider it's talking to. Swapping providers or adding new ones is a matter of implementing this interface — no changes needed upstream.

Tier Enforcement

Each AI model is tagged with a tier (FREE, ECONOMY, OPTIMIZED, PREMIUM). When a request comes in, the router checks the user's plan and filters to only eligible models:

const eligible = MODELS.filter(m =>
  TIER_ORDER[m.tier] <= TIER_ORDER[user.tier]
);

If the user requests a specific model above their tier, they receive an upgrade prompt rather than a silent downgrade. Transparency matters.

Fallback Chain

AI providers go down. Rate limits get hit. Models get deprecated. The router handles this with a priority-ordered fallback chain per tier:

const fallbackChain = [
  "claude-sonnet-4-6",      // primary
  "gpt-4o",                 // fallback 1
  "x-ai/grok-3",            // fallback 2
];

If the primary model returns a 5xx error or rate limit response, the router automatically retries with the next model in the chain. The response header X-Model-Used tells the client which model actually served the request.

Streaming Architecture

All providers support streaming, but their formats differ. OpenAI uses Server-Sent Events with data: {choices: [{delta: {content: "..."}}]}. Anthropic uses data: {type: "content_block_delta", delta: {text: "..."}}.

The router normalises these into a unified stream of data: {"text": "..."} events, which the frontend consumes identically regardless of provider.

Cost Tracking

Every response includes token counts (input + output). The router multiplies these by per-model rates stored in the database and writes a UsageRecord row. This powers the usage dashboard and daily limit enforcement.

const cost =
  (tokensIn * model.costPer1kIn / 1000) +
  (tokensOut * model.costPer1kOut / 1000);

OpenRouter Integration

For models we don't integrate directly (Grok, Llama, Mixtral), we route through OpenRouter, which provides a single OpenAI-compatible API for 200+ models. This gives us immediate access to new models without writing new adapters.

Results

The router has been handling production traffic since launch. Key metrics:

Average fallback rate: ~2% (providers are reliable, but not perfect)
P99 latency overhead from routing logic: <5ms
Zero user-visible errors from provider outages in the last 60 days

The abstraction layer has also let us add new providers (Grok, Gemini Flash) in under 2 hours each — writing the adapter, adding the models to the database, and deploying.

The Architecture Behind ZeroLimitAI: Multi-Provider AI Routing