LLM Adapter Layer: Abstract Gemini & OpenAI

Build an LLM adapter that unifies Gemini, OpenAI and local models—consistent prompts, telemetry, rate limiting and cost controls.

Hook — Stop juggling vendors: unify Gemini, OpenAI and local models with one adapter

If you manage production LLM workloads in 2026 you face a predictable set of headaches: different APIs, inconsistent prompt behavior, invisible costs, and fragile observability when functions are short-lived. After the Apple–Google Gemini deal in early 2026 many teams realised vendor partnerships can change overnight — but your stack doesn't have to. This tutorial shows how to implement an LLM adapter layer (a service mesh for models) that gives consistent prompts, unified telemetry, and per-request cost & rate controls across Gemini, OpenAI and local models.

Executive summary (inverted pyramid)

Build a lightweight adapter service that sits between your application and multiple LLM providers. It should:

Present a single, versioned API for prompts and completions.
Normalize prompts via template & schema enforcement.
Collect telemetry & traces with OpenTelemetry for each provider call.
Estimate token cost and enforce budgets at request or tenant level.
Select providers using policy (price, latency, capability) with fallbacks.

Below you get a practical architecture, code examples (TypeScript/Node), telemetry setup, rate-limiting/cost-control algorithms, and CI/CD recommendations for safe rollouts.

Why an LLM adapter matters in 2026

Two trends made adapters essential this year:

Vendor dynamics: Large platforms are consolidating. The Apple–Google Gemini partnership in early 2026 highlighted why teams must plan for provider shifts and interoperability changes.
Edge & local inference growth: Many orgs run smaller or privacy-sensitive models on-prem or at the edge, requiring unified routing between cloud-hosted giants and local containers.

Without an adapter you reimplement prompt logic, cost controls and tracing for each provider. That increases bugs, cost surprises, and operational toil.

High-level architecture

The adapter is a small stateless service (or set of services) deployed in your serverless platform or k8s cluster. Key components:

API gateway: handles auth, tenant routing, rate limiting and short-circuiting.
Adapter core: normalize requests, select provider policy, emit telemetry, aggregate responses.
Provider drivers: thin modules that translate our canonical request/response to provider-specific APIs (Gemini, OpenAI, local endpoints like Triton or text-generation-webui).
Telemetry & billing: collector that records token estimates, latency, error rates and cost charged per tenant.
Policy engine: decides provider based on SLA, cost, latency or explicit tenant preference.

Simple diagram (ASCII)

  Client --> API Gateway --> Adapter Core --> {Provider Driver: OpenAI | Gemini | Local}
                                         \--> Telemetry & Billing
                                         \--> Policy Engine (decisions)

Designing a canonical request schema

Define a minimal canonical payload so drivers implement a conversion layer instead of app code knowing multiple formats. Example JSON schema fields:

tenant_id
request_id
prompt_template_id (or inline prompt)
inputs: object (structured data for templates)
mode: {completion, chat, embedding}
max_tokens, temperature, top_p
budget_hint: numeric (USD cents) — used for cost caps

Why templates + schema?

Templates let you enforce consistent system messages and persona, while schema validation prevents prompt injection across providers. Versioned templates make behavior reproducible across model switches.

Provider drivers: translate, don't duplicate

Drivers should be tiny. They receive canonical payloads and return a canonical response. Example driver responsibilities:

Translate prompt + inputs into provider payload (chat structure, messages, or prompt field)
Map streaming vs non-streaming options
Extract token usage and costs from provider response or headers
Retry on transient errors using provider-specific recommendations

TypeScript driver sketch

/* driver/openai.ts */
import fetch from 'node-fetch'

export async function callOpenAI(canonical) {
  const payload = {
    model: canonical.model || 'gpt-4o',
    messages: buildMessagesFromTemplate(canonical),
    max_tokens: canonical.max_tokens,
    temperature: canonical.temperature
  }
  const res = await fetch('https://api.openai.com/v1/chat/completions', {
    method: 'POST',
    headers: { 'Authorization': 'Bearer ' + process.env.OPENAI_KEY, 'Content-Type': 'application/json' },
    body: JSON.stringify(payload)
  })
  const body = await res.json()
  // extract tokens if available, else estimate
  const tokens = body.usage?.total_tokens || estimateTokens(payload)
  return { text: body.choices[0].message.content, tokens }
}

Prompt normalization & templates

Create a managed templates store (file-based or DB) and a renderer that fills structured inputs. Enforce a system prompt per template to ensure behavior parity across providers.

// Example handlebars-like template
System: You are a helpful assistant for billing support.
User: {{customer_prompt}}

When you change system prompts, bump the template version. Associate templates with test cases (input -> expected behavior) so CI can detect behavioral regressions when you swap providers.

Telemetry: traces, metrics and logs

Short-lived functions and serverless invocations make tracing critical. Instrument the adapter core and drivers with OpenTelemetry spans. Capture these fields per request:

request_id, tenant_id
provider, model, model_version
tokens_in, tokens_out, estimated_cost
latency_ms (adapter and provider)
response_status & error codes

// simplified tracing pseudocode
const span = tracer.startSpan('adapter.request', { attributes: { tenant: tenant_id } })
span.setAttribute('provider', selectedProvider)
// then call driver which creates child spans for network calls

Forward metrics to your observability backend (Prometheus, Honeycomb, Datadog). Create dashboards for:

Cost per tenant per day
95th percentile latency per provider
Token consumption trends
Provider error rates & fallback events

Estimating cost and enforcing cost controls

Most cloud LLM providers price by tokens and per-request features. Token accounting is imperfect — you should both read provider usage headers and run a local token-estimator so you can enforce budgets before sending requests to expensive models.

Token & cost estimation strategy

Use a fast tokenizer (tiktoken or equivalent) locally to estimate tokens for prompt and expected completion.
Convert tokens to cost using provider pricing table fetched at runtime (cache prices, refresh daily).
Include overhead for provider-specific prompt wrapping (system prompts add tokens).

// cost estimate pseudocode
const promptTokens = tokenizer.count(prompt)
const expectedResponseTokens = canonical.max_tokens || 256
const totalTokens = promptTokens + expectedResponseTokens
const pricePer1k = priceTable[selectedProvider][model]
const estimatedUsd = totalTokens / 1000 * pricePer1k

Enforce cost caps in the adapter core:

Reject requests that exceed tenant budget_hint
Fallback to cheaper model or lower max_tokens when cost > cap
Support pre-spend reservation for high-value flows (hold budget until completion)

Rate limiting & concurrency controls

Rate limits protect both provider budgets and stability. Implement two layers:

API Gateway-level steady-state rate limiting (per-tenant, per-IP)
Adapter-level token bucket & concurrency caps for provider-specific pools

// simple token bucket implementation sketch
class TokenBucket {
  constructor(capacity, refillPerSec) { ... }
  tryTake(tokens) { /* returns true/false */ }
}

// usage
if (!bucket.tryTake(requiredTokens)) {
  return { status: 429, message: 'rate limit exceeded' }
}

Additionally implement adaptive throttling: if a provider's 95th latency spikes or error rate increases, introduce temporary client-side backoff and route new requests to alternatives.

Provider selection policy

Your policy engine should be configurable and dynamic. Typical rules:

Prefer low-cost providers if estimated latency is within SLA
Use high-capability providers (Gemini Pro / OpenAI GPT-4o) for tasks requiring reasoning or grounding
Respect tenant's explicit choices (e.g., a tenant with privacy requirements uses local models)
Route to fallback if primary fails or exceeds latency/cost thresholds

// policy pseudocode
if (tenant.prefersLocal) return 'local'
if (estimatedCost < tenant.budgetHint && latencyEstimate < SLA.ms) return 'cheap-provider'
return 'premium-provider'

Streaming & partial responses

Streaming support differs between providers. The adapter should expose both streaming and non-streaming endpoints and translate provider streams into a consistent SSE/WebSocket format to clients. Ensure telemetry captures partial-stream metrics (first-byte latency, total duration).

Security, privacy & compliance

2026 expectations increased for data governance. Add features:

Mask or strip PII before sending prompts to external providers.
Tenant opt-in to local-only inference (for regulated data).
Key management: use per-tenant provider keys stored in KMS; avoid environment-wide keys.

Testing & CI/CD for adapter upgrades

Because the adapter mediates behavior, tests must cover functional, behavioral and cost aspects:

Unit tests for drivers and template rendering.
Contract tests that simulate provider responses (mock OpenAI/Gemini endpoints).
Behavioral tests that ensure template version changes do not regress critical outputs—compare against golden outputs.
Cost tests to detect outlier token usage in changes.
Integration canaries where 1-5% of traffic is routed to a new provider or model.

// example CI job steps
- run unit tests
- run contract tests with provider mocks
- run behavior tests against golden set
- if new model alter template: run canary deployment and monitor metrics for 24h

Observability-driven rollouts

Automate rollouts with metrics gates. For example, when switching primary provider:

Start at 1% traffic
Monitor error rate, latency and token usage for 30 minutes
If metrics exceed thresholds, automatically rollback

Use synthetic tests that exercise different templates and verify both correctness and cost estimates.

Edge cases and operational guidance

Common problems and fixes:

Token mismatch on provider header: reconcile by keeping a local tokenizer and prioritise provider-reported usage for billing.
Different completion semantics: normalize via system prompts and post-process rules (e.g., trim, validate JSON).
Cold start latency for local models: keep warm pools or run tiny warmers in serverless prewarm tasks.

2026 trends to plan for

As of early 2026, expect:

Increased hybrid deals (like Apple–Gemini) — build for contracts and AB testing of provider outcomes.
More specialized local models for vertical tasks — the adapter must support local driver runtimes (Triton, ONNX, GGML).
Standardisation pressure — watch for common telemetry schemas and model capability tags; adopt them early to ease vendor switching.

"Siri is a Gemini" — a reminder that vendor relationships can shift product roadmaps; your abstraction layer should make such shifts transparent to your apps.

Full example: Minimal adapter flow (TypeScript + Fastify)

// app.ts (high-level sketch)
import Fastify from 'fastify'
import { selectProvider } from './policy'
import { callOpenAI } from './driver/openai'
import { callGemini } from './driver/gemini'
import { tracer } from './telemetry'

const app = Fastify()

app.post('/v1/llm', async (req, reply) => {
  const canonical = req.body // validated earlier
  const span = tracer.startSpan('adapter.handle')
  try {
    const provider = selectProvider(canonical)
    span.setAttribute('provider', provider)
    // estimate cost
    const estimate = estimateCost(canonical, provider)
    if (estimate > canonical.budget_hint) {
      // attempt cheaper fallback or reject
      return reply.code(402).send({ error: 'Budget exceeded' })
    }
    let result
    if (provider === 'openai') result = await callOpenAI(canonical)
    else if (provider === 'gemini') result = await callGemini(canonical)
    else result = await callLocal(canonical)

    span.setAttribute('tokens', result.tokens)
    reply.send({ text: result.text, tokens: result.tokens })
  } catch (err) {
    span.recordException(err)
    reply.code(500).send({ error: 'adapter_error' })
  } finally {
    span.end()
  }
})

app.listen({ port: 3000 })

Operational checklist before production

Implement per-tenant keys and KMS integration
Deploy provider driver mocks for offline testing
Create template versioning & golden tests
Hook telemetry to backend and build cost dashboards
Define SLAs and policy thresholds for auto-failover

Advanced strategies & future-proofing

To stay ahead in 2026:

Introduce a capability registry: tag models by strengths (reasoning, coding, summarization) to route tasks more intelligently.
Support multi-model fusion: scatter a request to multiple models and merge outputs (useful for safety & ensemble accuracy).
Adopt schema-guided prompts for programmatic verification (e.g., require JSON output that you can validate and safely use).

Case study: Replacing a costly model with a hybrid policy

One customer ran GPT-4o for all summarization work and saw unpredictable bills. They implemented an adapter with a policy: prefer a high-quality local summarizer for docs under 2k tokens; use Gemini Pro for long or complex documents. After rollout they reduced external spend by 45% while keeping perceived quality equal (A/B controlled test) — all without changing client code.

Wrap-up: Build once, swap providers safely

An LLM adapter layer transforms vendor churn, cost surprises and inconsistent prompts into manageable engineering tasks. By centralising prompt templates, telemetry, policy and billing logic you make your organization resilient to deals like Apple–Gemini and to the rapid evolution of model capabilities in 2026.

Actionable next steps (30/60/90)

30 days: Add a canonical request schema, local tokenizer and a single provider driver; implement templates.
60 days: Add telemetry with OpenTelemetry, token-based cost estimates and simple policy for fallback routing.
90 days: Add full provider fleet (Gemini, OpenAI, local), contract tests, canary rollouts and budget enforcement.

Call to action

Ready to stop rewriting prompt logic for every vendor? Start by scaffolding the canonical API and one driver. If you want a reference implementation, SDK examples for TypeScript, Go and Python, or CI job templates tuned for LLM contract testing — reach out or download the starter repo linked from functions.top. Make vendor swaps routine, not risky.

Implementing an LLM Adapter Layer: Handle Vendor Deals Like Apple–Google in Your Stack

Hook — Stop juggling vendors: unify Gemini, OpenAI and local models with one adapter

Executive summary (inverted pyramid)

Why an LLM adapter matters in 2026

High-level architecture

Designing a canonical request schema

Why templates + schema?

Provider drivers: translate, don't duplicate

TypeScript driver sketch

Prompt normalization & templates

Telemetry: traces, metrics and logs

Estimating cost and enforcing cost controls

Token & cost estimation strategy

Rate limiting & concurrency controls

Provider selection policy

Streaming & partial responses

Security, privacy & compliance

Testing & CI/CD for adapter upgrades

Observability-driven rollouts

Edge cases and operational guidance

2026 trends to plan for

Full example: Minimal adapter flow (TypeScript + Fastify)

Operational checklist before production

Advanced strategies & future-proofing

Case study: Replacing a costly model with a hybrid policy

Wrap-up: Build once, swap providers safely

Actionable next steps (30/60/90)

Call to action

Related Topics

functions

Up Next

Git Hooks Tools Compared: Husky, Lefthook, pre-commit, and More

Best Environment Variable Managers for Local Development

OpenAPI and Swagger Tools Compared: Editors, Validators, and Mock Servers

Hook — Stop juggling vendors: unify Gemini, OpenAI and local models with one adapter

Executive summary (inverted pyramid)

Why an LLM adapter matters in 2026

High-level architecture

Designing a canonical request schema

Why templates + schema?

Provider drivers: translate, don't duplicate

TypeScript driver sketch

Prompt normalization & templates

Telemetry: traces, metrics and logs

Estimating cost and enforcing cost controls

Token & cost estimation strategy

Rate limiting & concurrency controls

Provider selection policy

Streaming & partial responses

Security, privacy & compliance

Testing & CI/CD for adapter upgrades

Observability-driven rollouts

Edge cases and operational guidance

2026 trends to plan for

Full example: Minimal adapter flow (TypeScript + Fastify)

Operational checklist before production

Advanced strategies & future-proofing

Case study: Replacing a costly model with a hybrid policy

Wrap-up: Build once, swap providers safely

Actionable next steps (30/60/90)

Call to action

Related Reading

Related Topics

functions

Up Next

Git Hooks Tools Compared: Husky, Lefthook, pre-commit, and More

Best Environment Variable Managers for Local Development

OpenAPI and Swagger Tools Compared: Editors, Validators, and Mock Servers