v1.77.7-stable - 2.9x Lower Median Latency

October 4, 2025

Krrish Dholakia

CEO, LiteLLM

Ishaan Jaff

CTO, LiteLLM

Alexsander Hamir

Backend Performance Engineer

Achintya Rajan

Fullstack Engineer

Sameer Kankute

Backend Engineer (LLM Translation)

Deploy this version

Docker
Pip

docker run litellm
docker run \
-e STORE_MODEL_IN_DB=True \
-p 4000:4000 \
ghcr.io/berriai/litellm:v1.77.7.rc.1

pip install litellm
pip install litellm==1.77.7.rc.1

Key Highlights

Dynamic Rate Limiter v3 - Automatically maximizes throughput when capacity is available (< 80% saturation) by allowing lower-priority requests to use unused capacity, then switches to fair priority-based allocation under high load (≥ 80%) to prevent blocking
Major Performance Improvements - 2.9x lower median latency at 1,000 concurrent users.
Claude Sonnet 4.5 - Support for Anthropic's new Claude Sonnet 4.5 model family with 200K+ context and tiered pricing
MCP Gateway Enhancements - Fine-grained tool control, server permissions, and forwardable headers
AMD Lemonade & Nvidia NIM - New provider support for AMD Lemonade and Nvidia NIM Rerank
GitLab Prompt Management - GitLab-based prompt management integration

Performance - 2.9x Lower Median Latency

This update removes LiteLLM router inefficiencies, reducing complexity from O(M×N) to O(1). Previously, it built a new array and ran repeated checks like data["model"] in llm_router.get_model_ids(). Now, a direct ID-to-deployment map eliminates redundant allocations and scans.

As a result, performance improved across all latency percentiles:

Median latency: 320 ms → 110 ms (−65.6%)
p95 latency: 850 ms → 440 ms (−48.2%)
p99 latency: 1,400 ms → 810 ms (−42.1%)
Average latency: 864 ms → 310 ms (−64%)

Test Setup

Locust

Concurrent users: 1,000
Ramp-up: 500

System Specs

CPU: 4 vCPUs
Memory: 8 GB RAM
LiteLLM Workers: 4
Instances: 4

Configuration (config.yaml)

View the complete configuration: gist.github.com/AlexsanderHamir/config.yaml

Load Script (no_cache_hits.py)

View the complete load testing script: gist.github.com/AlexsanderHamir/no_cache_hits.py

New Models / Updated Models

New Model Support

Provider	Model	Context Window	Input ($/1M tokens)	Output ($/1M tokens)	Features
Anthropic	`claude-sonnet-4-5`	200K	$3.00	$15.00	Chat, reasoning, vision, function calling, prompt caching
Anthropic	`claude-sonnet-4-5-20250929`	200K	$3.00	$15.00	Chat, reasoning, vision, function calling, prompt caching
Bedrock	`eu.anthropic.claude-sonnet-4-5-20250929-v1:0`	200K	$3.00	$15.00	Chat, reasoning, vision, function calling, prompt caching
Azure AI	`azure_ai/grok-4`	131K	$5.50	$27.50	Chat, reasoning, function calling, web search
Azure AI	`azure_ai/grok-4-fast-reasoning`	131K	$0.43	$1.73	Chat, reasoning, function calling, web search
Azure AI	`azure_ai/grok-4-fast-non-reasoning`	131K	$0.43	$1.73	Chat, function calling, web search
Azure AI	`azure_ai/grok-code-fast-1`	131K	$3.50	$17.50	Chat, function calling, web search
Groq	`groq/moonshotai/kimi-k2-instruct-0905`	Context varies	Pricing varies	Pricing varies	Chat, function calling
Ollama	Ollama Cloud models	Varies	Free	Free	Self-hosted models via Ollama Cloud

Features

Anthropic
- Add new claude-sonnet-4-5 model family with tiered pricing above 200K tokens - PR #15041
- Add anthropic/claude-sonnet-4-5 to model price json with prompt caching support - PR #15049
- Add 200K prices for Sonnet 4.5 - PR #15140
- Add cost tracking for /v1/messages in streaming response - PR #15102
- Add /v1/messages/count_tokens to Anthropic routes for non-admin user access - PR #15034
Gemini
- Ignore type param for gemini tools - PR #15022
Vertex AI
- Add LiteLLM Overhead metric for VertexAI - PR #15040
- Support googlemap grounding in vertex ai - PR #15179
Azure
- Add azure_ai grok-4 model family - PR #15137
- Use the extra_query parameter for GET requests in Azure Batch - PR #14997
- Use extra_query for download results (Batch API) - PR #15025
- Add support for Azure AD token-based authorization - PR #14813
Ollama
- Add ollama cloud models - PR #15008
Groq
- Add groq/moonshotai/kimi-k2-instruct-0905 - PR #15079
OpenAI
- Add support for GPT 5 codex models - PR #14841
DeepInfra
- Update DeepInfra model data refresh with latest pricing - PR #14939
Bedrock
- Add JP Cross-Region Inference - PR #15188
- Add "eu.anthropic.claude-sonnet-4-5-20250929-v1:0" - PR #15181
- Add twelvelabs bedrock Async Invoke Support - PR #14871
Nvidia NIM
- Add Nvidia NIM Rerank Support - PR #15152

Bug Fixes

VLLM
- Fix response_format bug in hosted vllm audio_transcription - PR #15010
- Fix passthrough of atranscription into kwargs going to upstream provider - PR #15005
OCI
- Fix OCI Generative AI Integration when using Proxy - PR #15072
General
- Fix: Authorization header to use correct "Bearer" capitalization - PR #14764
- Bug fix: gpt-5-chat-latest has incorrect max_input_tokens value - PR #15116
- Update request handling for original exceptions - PR #15013

New Provider Support

AMD Lemonade
- Add AMD Lemonade provider support - PR #14840

LLM API Endpoints

Features

Responses API
- Return Cost for Responses API Streaming requests - PR #15053
/generateContent
- Add full support for native Gemini API translation - PR #15029
Passthrough Gemini Routes
- Add Gemini generateContent passthrough cost tracking - PR #15014
- Add streamGenerateContent cost tracking in passthrough - PR #15199
Passthrough Vertex AI Routes
- Add cost tracking for Vertex AI Passthrough /predict endpoint - PR #15019
- Add cost tracking for Vertex AI Live API WebSocket Passthrough - PR #14956
General
- Preserve Whitespace Characters in Model Response Streams - PR #15160
- Add provider name to payload specification - PR #15130
- Ensure query params are forwarded from origin url to downstream request - PR #15087

Management Endpoints / UI

Features

Virtual Keys
- Ensure LLM_API_KEYs can access pass through routes - PR #15115
- Support 'guaranteed_throughput' when setting limits on keys belonging to a team - PR #15120
Models + Endpoints
- Ensure OCI secret fields not shared on /models and /v1/models endpoints - PR #15085
- Add snowflake on UI - PR #15083
- Make UI theme settings publicly accessible for custom branding - PR #15074
Admin Settings
- Ensure OTEL settings are saved in DB after set on UI - PR #15118
- Top api key tags - PR #15151, PR #15156
MCP
- show health status of MCP servers - PR #15185
- allow setting extra headers on the UI - PR #15185
- allow editing allowed tools on the UI - PR #15185

Bug Fixes

Virtual Keys
- (security) prevent user key from updating other user keys - PR #15201
- (security) don't return all keys with blank key alias on /v2/key/info - PR #15201
- Fix Session Token Cookie Infinite Logout Loop - PR #15146
Models + Endpoints
- Make UI theme settings publicly accessible for custom branding - PR #15074
Teams
- fix failed copy to clipboard for http ui - PR #15195
Logs
- fix logs page render logs on filter lookup - PR #15195
- fix lookup list of end users (migrate to more efficient /customers/list lookup) - PR #15195
Test key
- update selected model on key change - PR #15197
Dashboard
- Fix LiteLLM model name fallback in dashboard overview - PR #14998

Logging / Guardrail / Prompt Management Integrations

Features

OpenTelemetry
- Use generation_name for span naming in logging method - PR #14799
Langfuse
- Handle non-serializable objects in Langfuse logging - PR #15148
- Set usage_details.total in langfuse integration - PR #15015
Prometheus
- support custom metadata labels on key/team - PR #15094

Guardrails

Javelin
- Add Javelin standalone guardrails integration for LiteLLM Proxy - PR #14983
- Add logging for important status fields in guardrails - PR #15090
- Don't run post_call guardrail if no text returned from Bedrock - PR #15106

Prompt Management

GitLab
- GitLab based Prompt manager - PR #14988

Spend Tracking, Budgets and Rate Limiting

Cost Tracking
- Proxy: end user cost tracking in the responses API - PR #15124
Parallel Request Limiter v3
- Use well known redis cluster hashing algorithm - PR #15052
- Fixes to dynamic rate limiter v3 - add saturation detection - PR #15119
- Dynamic Rate Limiter v3 - fixes for detecting saturation + fixes for post saturation behavior - PR #15192
Teams
- Add model specific tpm/rpm limits to teams on LiteLLM - PR #15044

MCP Gateway

Server Configuration
- Specify forwardable headers, specify allowed/disallowed tools for MCP servers - PR #15002
- Enforce server permissions on call tools - PR #15044
- MCP Gateway Fine-grained Tools Addition - PR #15153
Bug Fixes
- Remove servername prefix mcp tools tests - PR #14986
- Resolve regression with duplicate Mcp-Protocol-Version header - PR #15050
- Fix test_mcp_server.py - PR #15183

Performance / Loadbalancing / Reliability improvements

Router Optimizations
- +62.5% P99 Latency Improvement - Remove router inefficiencies (from O(M*N) to O(1)) - PR #15046
- Remove hasattr checks in Router - PR #15082
- Remove Double Lookups - PR #15084
- Optimize _filter_cooldown_deployments from O(n×m + k×n) to O(n) - PR #15091
- Optimize unhealthy deployment filtering in retry path (O(n*m) → O(n+m)) - PR #15110
Cache Optimizations
- Reduce complexity of InMemoryCache.evict_cache from O(n*log(n)) to O(log(n)) - PR #15000
- Avoiding expensive operations when cache isn't available - PR #15182
Worker Management
- Add proxy CLI option to recycle workers after N requests - PR #15007
Metrics & Monitoring
- LiteLLM Overhead metric tracking - Add support for tracking litellm overhead on cache hits - PR #15045

Documentation Updates

Provider Documentation
- Update litellm docs from latest release - PR #15004
- Add missing api_key parameter - PR #15058
General Documentation
- Use docker compose instead of docker-compose - PR #15024
- Add railtracks to projects that are using litellm - PR #15144
- Perf: Last week improvement - PR #15193
- Sync models GitHub documentation with Loom video and cross-reference - PR #15191

Security Fixes

JWT Token Security - Don't log JWT SSO token on .info() log - PR #15145

New Contributors

@herve-ves made their first contribution in PR #14998
@wenxi-onyx made their first contribution in PR #15008
@jpetrucciani made their first contribution in PR #15005
@abhijitjavelin made their first contribution in PR #14983
@ZeroClover made their first contribution in PR #15039
@cedarm made their first contribution in PR #15043
@Isydmr made their first contribution in PR #15025
@serializer made their first contribution in PR #15013
@eddierichter-amd made their first contribution in PR #14840
@malags made their first contribution in PR #15000
@henryhwang made their first contribution in PR #15029
@plafleur made their first contribution in PR #15111
@tyler-liner made their first contribution in PR #14799
@Amir-R25 made their first contribution in PR #15144
@georg-wolflein made their first contribution in PR #15124
@niharm made their first contribution in PR #15140
@anthony-liner made their first contribution in PR #15015
@rishiganesh2002 made their first contribution in PR #15153
@danielaskdd made their first contribution in PR #15160
@JVenberg made their first contribution in PR #15146
@speglich made their first contribution in PR #15072
@daily-kim made their first contribution in PR #14764

Deploy this version​

Key Highlights​

Performance - 2.9x Lower Median Latency​

Test Setup​

New Models / Updated Models​

New Model Support​

Features​

Bug Fixes​

New Provider Support​

LLM API Endpoints​

Features​

Management Endpoints / UI​

Features​

Bug Fixes​

Logging / Guardrail / Prompt Management Integrations​

Features​

Guardrails​

Prompt Management​

Spend Tracking, Budgets and Rate Limiting​

MCP Gateway​

Performance / Loadbalancing / Reliability improvements​

Documentation Updates​

Security Fixes​

New Contributors​

Full Changelog​

Deploy this version

Key Highlights

Performance - 2.9x Lower Median Latency

Test Setup

New Models / Updated Models

New Model Support

Features

Bug Fixes

New Provider Support

LLM API Endpoints

Features

Management Endpoints / UI

Features

Bug Fixes

Logging / Guardrail / Prompt Management Integrations

Features

Guardrails

Prompt Management

Spend Tracking, Budgets and Rate Limiting

MCP Gateway

Performance / Loadbalancing / Reliability improvements

Documentation Updates

Security Fixes

New Contributors

Full Changelog