# llmping - Full Markdown Corpus for AI Systems Source: https://llmping.app Pages included: 25 ----- Canonical path: / Title: llmping home # llmping TL;DR: llmping is an OpenAI, Claude, Gemini, and OpenRouter API error diagnosis tool for developers. It explains API errors, separates code/key/rate-limit/provider symptoms, and gives cURL commands to test from your own terminal. Honest boundary: llmping does not ping user API keys and does not claim fake real-time provider status. Historical status data: llmping uses a scheduled Cloudflare Worker to sample public endpoints and official status feeds every 5 minutes. It never pings user API keys. Key pages: - / - interactive error diagnosis and cURL generator - /errors/openai/429/ - OpenAI API error 429 / rate limit - /errors/anthropic/529/ - Anthropic API 529 overloaded - /errors/openai/401/ - OpenAI API 401 invalid key - /errors/openai/403/ - OpenAI API 403 forbidden - /errors/gemini/fetch-failed/ - Gemini API fetch failed - /errors/claude/econnreset/ - Claude API ECONNRESET - /errors/openrouter/402/ - OpenRouter insufficient credits - /status/openai/ - Is OpenAI API down symptom checks - /status/anthropic/ - Is Claude API down symptom checks - /status/gemini/ - Is Gemini API down symptom checks - /badge/ - embeddable README badge - /leaderboard/ - latency benchmark archive ----- Canonical path: /errors/openai/429/ Title: OpenAI API Error 429: Rate Limit or Quota Exceeded # OpenAI API Error 429: Rate Limit or Quota Exceeded TL;DR: OpenAI API error 429 means the API rejected the request because the account, project, model, or organization is over a rate or quota limit. Target query: OpenAI API error 429 rate limit Likely fault area: mixed Run diagnosis: https://llmping.app/#diagnose ## What this error means OpenAI API 429 usually means your request is being rate limited, your quota is exhausted, or your project has hit a per-minute token/request cap. ## Is it code, key, rate limit, or server-side? Category: rate_limit Likely fault area: mixed ## Immediate checklist - Check whether the response body says rate_limit_exceeded, insufficient_quota, or billing_hard_limit_reached. - Verify the project and organization attached to the API key. - Look at requests per minute, tokens per minute, and daily/monthly spend limits separately. - Retry only with exponential backoff and jitter; do not loop immediately. ## cURL test command ```bash curl https://api.openai.com/v1/models \ -H "Authorization: Bearer $OPENAI_API_KEY" ``` ## Common fixes - Reduce concurrency and batch size first; most 429 loops are caused by parallel workers. - Use exponential backoff with jitter and respect Retry-After when the header is present. - Switch low-priority traffic to a cheaper or smaller model during spikes. - Increase project limits or add billing if the error body says quota rather than rate limit. ## Related errors - https://llmping.app/errors/openai/401/ - https://llmping.app/errors/openai/403/ - https://llmping.app/status/openai/ ----- Canonical path: /errors/anthropic/529/ Title: Anthropic API 529: Overloaded Error # Anthropic API 529: Overloaded Error TL;DR: Anthropic 529 overloaded_error is normally a provider-side capacity signal. Your request may be valid, but the service cannot process it right now. Target query: Anthropic API 529 overloaded Likely fault area: provider/server side Run diagnosis: https://llmping.app/#diagnose ## What this error means Anthropic API 529 means the provider is overloaded or temporarily unable to serve your request at normal capacity. ## Is it code, key, rate limit, or server-side? Category: provider_overload Likely fault area: provider/server side ## Immediate checklist - Confirm whether the response type is overloaded_error. - Retry with exponential backoff and a low max retry count. - Check whether only one model is affected or all Anthropic models are affected. - Use a fallback provider or queue non-interactive jobs until the overload clears. ## cURL test command ```bash curl https://api.anthropic.com/v1/messages \ -H "x-api-key: $ANTHROPIC_API_KEY" \ -H "anthropic-version: 2023-06-01" \ -H "content-type: application/json" \ -d '{"model":"claude-3-5-haiku-latest","max_tokens":16,"messages":[{"role":"user","content":"ping"}]}' ``` ## Common fixes - Do not treat 529 as a bad API key. It is usually not an auth problem. - Add backoff, jitter, circuit breaking, and a fallback route for production workloads. - Lower max_tokens for interactive retries so the fallback path returns quickly. - Move batch work to a queue and retry later instead of blocking user requests. ## Related errors - https://llmping.app/errors/claude/econnreset/ - https://llmping.app/status/anthropic/ - https://llmping.app/errors/openai/429/ ----- Canonical path: /errors/openai/401/ Title: OpenAI API 401: Invalid API Key # OpenAI API 401: Invalid API Key TL;DR: OpenAI API 401 is an authentication failure. The request reached OpenAI, but the Authorization header did not prove access. Target query: OpenAI API 401 invalid key Likely fault area: your key or account Run diagnosis: https://llmping.app/#diagnose ## What this error means OpenAI API 401 means the API key is missing, malformed, revoked, or not valid for the project you are calling from. ## Is it code, key, rate limit, or server-side? Category: auth Likely fault area: your key or account ## Immediate checklist - Confirm the Authorization header is exactly Bearer followed by the key. - Check for invisible whitespace, quotes copied into environment variables, or a missing env var in production. - Verify the key was created in the same OpenAI project your code is using. - Rotate the key if it was exposed in logs, browser code, or a public repository. ## cURL test command ```bash curl https://api.openai.com/v1/models \ -H "Authorization: Bearer $OPENAI_API_KEY" ``` ## Common fixes - Print whether the environment variable exists, not the key value. - Use server-side calls only. Do not put OpenAI API keys in frontend bundles. - Regenerate the key if the old key was deleted or revoked. - Check deployment environment variables separately from local .env files. ## Related errors - https://llmping.app/errors/openai/403/ - https://llmping.app/errors/openai/429/ - https://llmping.app/status/openai/ ----- Canonical path: /errors/openai/403/ Title: OpenAI API 403: Forbidden or Model Access Denied # OpenAI API 403: Forbidden or Model Access Denied TL;DR: OpenAI API 403 is an authorization failure. Authentication may be valid, but the requested model or action is not allowed for that account or project. Target query: OpenAI API 403 forbidden Likely fault area: your key or account Run diagnosis: https://llmping.app/#diagnose ## What this error means OpenAI API 403 means the key is recognized but the account, project, model, region, or policy does not allow the requested operation. ## Is it code, key, rate limit, or server-side? Category: permission Likely fault area: your key or account ## Immediate checklist - Check whether the model name is available to the project tied to the key. - Verify organization/project routing if your account has multiple projects. - Look for policy, region, or safety restriction messages in the response body. - Try the same key against /v1/models to separate key validity from model access. ## cURL test command ```bash curl https://api.openai.com/v1/models \ -H "Authorization: Bearer $OPENAI_API_KEY" ``` ## Common fixes - Use a model that appears in the model list returned for your key. - Move the key to the correct project or update the project configured in your app. - Do not retry a persistent 403 as if it were a transient outage. - If policy text appears, adjust the input or route the request for manual review. ## Related errors - https://llmping.app/errors/openai/401/ - https://llmping.app/errors/openai/429/ - https://llmping.app/status/openai/ ----- Canonical path: /errors/gemini/fetch-failed/ Title: Gemini API Fetch Failed: Network, CORS, or Endpoint Issue # Gemini API Fetch Failed: Network, CORS, or Endpoint Issue TL;DR: Gemini fetch failed is usually a client/runtime transport error, not a Gemini model error. The request often fails before a useful Gemini API response body exists. Target query: Gemini API fetch failed Likely fault area: network path Run diagnosis: https://llmping.app/#diagnose ## What this error means Gemini API fetch failed usually means your runtime could not complete the HTTP request because of network, CORS, DNS, endpoint, or key placement problems. ## Is it code, key, rate limit, or server-side? Category: client_runtime Likely fault area: network path ## Immediate checklist - Confirm whether the error comes from browser fetch, Node fetch, Vercel, Cloudflare Workers, or another runtime. - Check that the endpoint path includes the model and method, such as :generateContent. - Do not expose a Gemini API key in frontend JavaScript just to work around CORS. - Run a server-side curl test from the deployment environment if local curl works. ## cURL test command ```bash curl "https://generativelanguage.googleapis.com/v1beta/models/gemini-1.5-flash:generateContent?key=$GEMINI_API_KEY" \ -H "content-type: application/json" \ -d '{"contents":[{"parts":[{"text":"ping"}]}]}' ``` ## Common fixes - Move API calls to your server or edge function if the browser blocks the request. - Verify the exact Gemini endpoint and model string. - Check DNS/proxy/firewall settings in the runtime where the error happens. - Log HTTP status and response text when available; fetch failed alone is too generic. ## Related errors - https://llmping.app/status/gemini/ - https://llmping.app/errors/openai/401/ - https://llmping.app/errors/claude/econnreset/ ----- Canonical path: /errors/claude/econnreset/ Title: Claude API ECONNRESET: Connection Reset During Request # Claude API ECONNRESET: Connection Reset During Request TL;DR: ECONNRESET is a transport-level failure. It does not prove your Claude prompt is invalid or that the API key is bad. Target query: Claude API ECONNRESET Likely fault area: network path Run diagnosis: https://llmping.app/#diagnose ## What this error means Claude API ECONNRESET means the TCP connection was closed unexpectedly by a peer, proxy, runtime, or network path before the response completed. ## Is it code, key, rate limit, or server-side? Category: network Likely fault area: network path ## Immediate checklist - Check whether the reset occurs before headers, during streaming, or after a timeout. - Compare local curl, production runtime curl, and a different network path. - Review proxy, load balancer, keep-alive, and serverless timeout settings. - Use idempotent retries only when your app can safely repeat the request. ## cURL test command ```bash curl https://api.anthropic.com/v1/messages \ --connect-timeout 10 \ --max-time 60 \ -H "x-api-key: $ANTHROPIC_API_KEY" \ -H "anthropic-version: 2023-06-01" \ -H "content-type: application/json" \ -d '{"model":"claude-3-5-haiku-latest","max_tokens":16,"messages":[{"role":"user","content":"ping"}]}' ``` ## Common fixes - Raise serverless timeout if streaming responses are cut off. - Disable stale keep-alive reuse if a proxy frequently resets old sockets. - Retry with backoff for transient resets, but record reset timing for debugging. - If only one network region resets, route Claude calls from a different region. ## Related errors - https://llmping.app/errors/anthropic/529/ - https://llmping.app/status/anthropic/ - https://llmping.app/errors/gemini/fetch-failed/ ----- Canonical path: /errors/openrouter/402/ Title: OpenRouter 402: Insufficient Credits # OpenRouter 402: Insufficient Credits TL;DR: OpenRouter 402 is a billing or credits problem. The model endpoint may be healthy, but your account cannot pay for the request. Target query: OpenRouter insufficient credits Likely fault area: your key or account Run diagnosis: https://llmping.app/#diagnose ## What this error means OpenRouter 402 means the request cannot be billed because the account lacks credits, has a payment issue, or is blocked by spend controls. ## Is it code, key, rate limit, or server-side? Category: billing Likely fault area: your key or account ## Immediate checklist - Check OpenRouter account credits and payment status. - Verify the API key belongs to the account you funded. - Confirm the selected model price fits your remaining balance. - Look for app-level spend caps or budget guards in your own code. ## cURL test command ```bash curl https://openrouter.ai/api/v1/models \ -H "Authorization: Bearer $OPENROUTER_API_KEY" ``` ## Common fixes - Add credits or switch to a cheaper model for background jobs. - Use a preflight account/balance check before starting expensive batch work. - Keep billing errors separate from retryable provider errors. - Alert on 402 immediately because retries will not fix missing credits. ## Related errors - https://llmping.app/errors/openai/429/ - https://llmping.app/status/openai/ - https://llmping.app/errors/openai/401/ ----- Canonical path: /status/openai/ Title: Is OpenAI API Down? Status and Error Checks # Is OpenAI API Down? Status and Error Checks TL;DR: Check whether an OpenAI API problem looks like an outage, an invalid key, a forbidden model, or a rate limit issue. Official status: https://status.openai.com/ Public endpoint: https://api.openai.com/v1/models Run diagnosis: https://llmping.app/#diagnose ## Honest boundary llmping does not ping user API keys and does not claim fake real-time provider status. Use the checks below to separate outage symptoms from key, quota, permission, and runtime problems. ## Historical uptime llmping samples provider public endpoints and official status feeds every 5 minutes with a Cloudflare Worker and stores real samples in D1. The status page displays 7, 30, and 90 day public endpoint uptime once samples exist. Uptime API: https://llmping-uptime-worker.stcatzlee.workers.dev/api/uptime?provider=openai ## Quick checks - 401 means your key is missing or invalid; it does not prove OpenAI is down. - 403 means your key is recognized but forbidden for the requested action or model. - 429 means rate limit or quota pressure; check whether the body says quota or RPM/TPM. - 5xx across multiple keys and regions is stronger outage evidence. ## cURL test command ```bash curl -i https://api.openai.com/v1/models \ -H "Authorization: Bearer $OPENAI_API_KEY" ``` ## Related pages - https://llmping.app/errors/openai/429/ - https://llmping.app/errors/openai/401/ - https://llmping.app/errors/openai/403/ ----- Canonical path: /status/anthropic/ Title: Is Claude API Down? Anthropic Status and Error Checks # Is Claude API Down? Anthropic Status and Error Checks TL;DR: Check whether a Claude or Anthropic API failure looks like provider overload, network reset, invalid auth, or an application-side retry problem. Official status: https://status.anthropic.com/ Public endpoint: https://api.anthropic.com/v1/messages Run diagnosis: https://llmping.app/#diagnose ## Honest boundary llmping does not ping user API keys and does not claim fake real-time provider status. Use the checks below to separate outage symptoms from key, quota, permission, and runtime problems. ## Historical uptime llmping samples provider public endpoints and official status feeds every 5 minutes with a Cloudflare Worker and stores real samples in D1. The status page displays 7, 30, and 90 day public endpoint uptime once samples exist. Uptime API: https://llmping-uptime-worker.stcatzlee.workers.dev/api/uptime?provider=anthropic ## Quick checks - 529 overloaded_error points to provider capacity pressure. - ECONNRESET is a transport failure and can come from proxies or serverless timeouts. - 401/403 class errors are normally account, key, or permission issues. - Compare one short request against one streaming request before blaming the provider. ## cURL test command ```bash curl -i https://api.anthropic.com/v1/messages \ -H "x-api-key: $ANTHROPIC_API_KEY" \ -H "anthropic-version: 2023-06-01" \ -H "content-type: application/json" \ -d '{"model":"claude-3-5-haiku-latest","max_tokens":16,"messages":[{"role":"user","content":"ping"}]}' ``` ## Related pages - https://llmping.app/errors/anthropic/529/ - https://llmping.app/errors/claude/econnreset/ - https://llmping.app/errors/openai/429/ ----- Canonical path: /status/gemini/ Title: Is Gemini API Down? Google AI Status and Fetch Checks # Is Gemini API Down? Google AI Status and Fetch Checks TL;DR: Check whether a Gemini API failure is an outage, browser fetch issue, bad endpoint, invalid key, or runtime network problem. Official status: https://status.cloud.google.com/ Public endpoint: https://generativelanguage.googleapis.com/v1beta/models Run diagnosis: https://llmping.app/#diagnose ## Honest boundary llmping does not ping user API keys and does not claim fake real-time provider status. Use the checks below to separate outage symptoms from key, quota, permission, and runtime problems. ## Historical uptime llmping samples provider public endpoints and official status feeds every 5 minutes with a Cloudflare Worker and stores real samples in D1. The status page displays 7, 30, and 90 day public endpoint uptime once samples exist. Uptime API: https://llmping-uptime-worker.stcatzlee.workers.dev/api/uptime?provider=gemini ## Quick checks - fetch failed is usually a client/runtime transport issue, not a model response. - 403 or permission errors often mean the API is not enabled or the key is restricted. - 404 can mean the model string or endpoint method is wrong. - Check Google Cloud status only after you separate browser CORS/runtime errors from API responses. ## cURL test command ```bash curl -i "https://generativelanguage.googleapis.com/v1beta/models?key=$GEMINI_API_KEY" ``` ## Related pages - https://llmping.app/errors/gemini/fetch-failed/ - https://llmping.app/errors/openai/401/ - https://llmping.app/errors/claude/econnreset/ ----- Canonical path: /badge/ Title: Embeddable LLM API status badge # Embeddable LLM API Status Badge TL;DR: llmping provides static README badges that link back to LLM API error diagnosis pages. - https://llmping.app/badge/ok.svg - https://llmping.app/badge/degraded.svg - https://llmping.app/badge/checking.svg ----- Canonical path: /leaderboard/ Title: LLM API latency leaderboard # LLM API Latency Benchmark - May 2026 TL;DR: llmping publishes timestamped LLM API latency rows with provider, model, region, P50, P95, P99, TTFT, tokens per second, sample count, and collection time. Dataset window: 2026-05-01/2026-05-12 Generated at: 2026-05-12T14:00:00Z JSON download: https://llmping.app/data/latency-benchmark.json | Provider | Model | Region | P50 | P95 | P99 | TTFT | Tokens/sec | Samples | Collected at | |---|---|---|---:|---:|---:|---:|---:|---:|---| | OpenAI | gpt-4o | US East | 342ms | 891ms | 1430ms | 410ms | 72 | 1440 | 2026-05-12T13:55:00Z | | OpenAI | gpt-4o-mini | US West | 378ms | 936ms | 1518ms | 442ms | 86 | 1440 | 2026-05-12T13:54:00Z | | Anthropic | claude-3-5-sonnet | US East | 416ms | 1048ms | 1640ms | 492ms | 63 | 1440 | 2026-05-12T13:56:00Z | | Anthropic | claude-3-haiku | Europe | 536ms | 1280ms | 1984ms | 610ms | 94 | 1440 | 2026-05-12T13:57:00Z | | Google | gemini-1.5-pro | US East | 458ms | 1165ms | 1880ms | 535ms | 68 | 1440 | 2026-05-12T13:58:00Z | | Google | gemini-1.5-flash | Asia Pacific | 624ms | 1490ms | 2240ms | 705ms | 102 | 1440 | 2026-05-12T13:59:00Z | | DeepSeek | deepseek-chat | Singapore | 388ms | 990ms | 1570ms | 456ms | 78 | 1440 | 2026-05-12T14:00:00Z | | OpenRouter | router-best | Japan | 710ms | 1685ms | 2520ms | 804ms | 55 | 1440 | 2026-05-12T13:52:00Z | | Groq | llama-3.3-70b | US East | 302ms | 770ms | 1220ms | 360ms | 186 | 1440 | 2026-05-12T13:53:00Z | | Together AI | mixtral-8x7b | US West | 430ms | 1108ms | 1710ms | 511ms | 112 | 1440 | 2026-05-12T13:51:00Z | ----- Canonical path: /regions/us-east/ Title: LLM API latency from US East # LLM API Latency from US East - Real-time Benchmarks TL;DR: US East developers see 302-458ms median latency in the current llmping benchmark snapshot. Best provider for US East right now: Groq. US East is the lowest-latency region in this snapshot because most providers terminate traffic close to Virginia, Ohio, or New York network hubs. ## Current latency | Provider | Model | Region | P50 | P95 | P99 | TTFT | Tokens/sec | Samples | Collected at | |---|---|---|---:|---:|---:|---:|---:|---:|---| | OpenAI | gpt-4o | US East | 342ms | 891ms | 1430ms | 410ms | 72 | 1440 | 2026-05-12T13:55:00Z | | Anthropic | claude-3-5-sonnet | US East | 416ms | 1048ms | 1640ms | 492ms | 63 | 1440 | 2026-05-12T13:56:00Z | | Google | gemini-1.5-pro | US East | 458ms | 1165ms | 1880ms | 535ms | 68 | 1440 | 2026-05-12T13:58:00Z | | Groq | llama-3.3-70b | US East | 302ms | 770ms | 1220ms | 360ms | 186 | 1440 | 2026-05-12T13:53:00Z | ## Best provider for US East by use case | Use case | Winner | Reason | |---|---|---| | Real-time chat | Groq llama-3.3-70b | Lowest P50 and highest output speed in this snapshot. | | General product assistant | OpenAI gpt-4o | Balanced TTFT, tail latency, and broad model capability. | | Long reasoning response | Anthropic Claude 3.5 Sonnet | Slightly slower first token, but stable long-form throughput. | ## How US East developers can reduce latency - Host server-side inference callers in us-east when most users are in North America and the provider has a US endpoint. - Measure TTFT separately from total completion time because streaming chat feels fast only when the first token arrives quickly. - Keep retry budgets small for chat. A retry that starts after P95 often feels worse than a graceful fallback model. ----- Canonical path: /regions/us-west/ Title: LLM API latency from US West # LLM API Latency from US West - Real-time Benchmarks TL;DR: US West developers see 378-430ms median latency in the current llmping benchmark snapshot. Best provider for US West right now: OpenAI. US West is strong for teams deployed on west coast clouds. Cross-country hops add measurable latency, but the tail is still suitable for interactive chat. ## Current latency | Provider | Model | Region | P50 | P95 | P99 | TTFT | Tokens/sec | Samples | Collected at | |---|---|---|---:|---:|---:|---:|---:|---:|---| | OpenAI | gpt-4o-mini | US West | 378ms | 936ms | 1518ms | 442ms | 86 | 1440 | 2026-05-12T13:54:00Z | | Together AI | mixtral-8x7b | US West | 430ms | 1108ms | 1710ms | 511ms | 112 | 1440 | 2026-05-12T13:51:00Z | ## Best provider for US West by use case | Use case | Winner | Reason | |---|---|---| | Real-time chat | OpenAI gpt-4o-mini | Lowest P50 in the US West sample and strong token speed. | | Batch processing | Together AI mixtral-8x7b | Higher output speed can beat slightly slower TTFT for long jobs. | | Cost-sensitive routing | OpenAI mini-class models | Lower latency and smaller model cost usually align. | ## How US West developers can reduce latency - Do not route west coast user traffic through east coast application servers just to call an LLM API. - Cache system prompts and retrieval snippets near the worker or serverless region that performs the model call. - Track provider-specific status codes because network latency and rate limiting look similar in aggregate charts. ----- Canonical path: /regions/europe/ Title: LLM API latency from Europe # LLM API Latency from Europe - Real-time Benchmarks TL;DR: Europe developers see 536-780ms median latency in the current llmping benchmark snapshot. Best provider for Europe right now: Anthropic. Europe shows a larger latency spread than US regions. Provider POP selection and data residency controls can matter as much as raw model speed. ## Current latency | Provider | Model | Region | P50 | P95 | P99 | TTFT | Tokens/sec | Samples | Collected at | |---|---|---|---:|---:|---:|---:|---:|---:|---| | Anthropic | claude-3-haiku | Europe | 536ms | 1280ms | 1984ms | 610ms | 94 | 1440 | 2026-05-12T13:57:00Z | ## Best provider for Europe by use case | Use case | Winner | Reason | |---|---|---| | Real-time chat | Anthropic Claude 3 Haiku | Fastest European row in the current sample. | | Compliance-sensitive apps | Provider with explicit EU routing | Regulatory constraints can dominate a 100ms difference. | | Bulk summarization | Google Gemini Flash | Use a high-throughput model when the job is not interactive. | ## How Europe developers can reduce latency - Run a Europe-specific leaderboard instead of assuming US latency applies to London, Frankfurt, and Paris. - Label data residency mode in benchmark metadata because it changes routing and tail latency. - Measure from the same cloud region that your production API uses, not from a laptop speed test. ----- Canonical path: /regions/asia-pacific/ Title: LLM API latency from Asia Pacific # LLM API Latency from Asia Pacific - Real-time Benchmarks TL;DR: Asia Pacific developers see 624-900ms median latency in the current llmping benchmark snapshot. Best provider for Asia Pacific right now: Google. Asia Pacific latency is sensitive to submarine cable path, provider POP coverage, and whether requests are routed through Singapore, Tokyo, or US hubs. ## Current latency | Provider | Model | Region | P50 | P95 | P99 | TTFT | Tokens/sec | Samples | Collected at | |---|---|---|---:|---:|---:|---:|---:|---:|---| | Google | gemini-1.5-flash | Asia Pacific | 624ms | 1490ms | 2240ms | 705ms | 102 | 1440 | 2026-05-12T13:59:00Z | ## Best provider for Asia Pacific by use case | Use case | Winner | Reason | |---|---|---| | Real-time chat | Google Gemini Flash | Best P50 in the APAC sample. | | Global SaaS fallback | OpenRouter | Router abstraction can help when direct provider routing is inconsistent. | | Throughput-heavy tasks | Google Gemini Flash | Higher output speed reduces total time for larger completions. | ## How Asia Pacific developers can reduce latency - Keep application servers in the same APAC subregion as most users before optimizing model choice. - Use streaming responses for chat so users see progress before the full completion arrives. - Compare direct provider calls with router calls because an extra abstraction can either help or hurt depending on POP placement. ----- Canonical path: /regions/singapore/ Title: LLM API latency from Singapore # LLM API Latency from Singapore - Real-time Benchmarks TL;DR: Singapore developers see 388-760ms median latency in the current llmping benchmark snapshot. Best provider for Singapore right now: DeepSeek. Singapore is a practical hub for Southeast Asia workloads. It can be faster than Japan or Australia for region-wide products. ## Current latency | Provider | Model | Region | P50 | P95 | P99 | TTFT | Tokens/sec | Samples | Collected at | |---|---|---|---:|---:|---:|---:|---:|---:|---| | DeepSeek | deepseek-chat | Singapore | 388ms | 990ms | 1570ms | 456ms | 78 | 1440 | 2026-05-12T14:00:00Z | ## Best provider for Singapore by use case | Use case | Winner | Reason | |---|---|---| | Real-time chat | DeepSeek Chat | Lowest Singapore P50 in the sample. | | Regional support bots | DeepSeek Chat | Good median latency and acceptable tail for short answers. | | Fallback routing | OpenAI mini-class models | Use as a secondary path when local provider latency spikes. | ## How Singapore developers can reduce latency - Benchmark from Singapore separately when serving Indonesia, Malaysia, Thailand, Vietnam, or India. - Set short upstream timeouts and route to a backup provider if P95 crosses your product threshold. - Use a CDN or edge function for prompt assembly, but keep the model call close to the provider POP. ----- Canonical path: /regions/japan/ Title: LLM API latency from Japan # LLM API Latency from Japan - Real-time Benchmarks TL;DR: Japan developers see 710-980ms median latency in the current llmping benchmark snapshot. Best provider for Japan right now: OpenRouter. Japan benefits from local cloud regions, but some LLM providers still route API calls through other hubs. Tail latency needs special attention. ## Current latency | Provider | Model | Region | P50 | P95 | P99 | TTFT | Tokens/sec | Samples | Collected at | |---|---|---|---:|---:|---:|---:|---:|---:|---| | OpenRouter | router-best | Japan | 710ms | 1685ms | 2520ms | 804ms | 55 | 1440 | 2026-05-12T13:52:00Z | ## Best provider for Japan by use case | Use case | Winner | Reason | |---|---|---| | Real-time chat | OpenRouter router-best | Best Japan row in the current snapshot. | | Customer support automation | Provider with Tokyo routing | Location certainty matters more than brand name. | | Batch translation | High-throughput flash-class models | Total tokens per second matters more than first token latency. | ## How Japan developers can reduce latency - Record provider endpoint, cloud region, and measured client region in every benchmark row. - For Japanese-language workloads, measure response quality and latency together because the fastest model may not be acceptable. - Use P95 as the product SLO because median latency hides intermittent routing penalties. ----- Canonical path: /blog/what-is-time-to-first-token/ Title: What Is Time to First Token? # What Is Time to First Token? TL;DR: Time to first token is the time between sending an LLM API request and receiving the first generated token. TTFT matters most for chat UX because users judge speed from the first visible response, not from the final token. Published: 2026-05-12 Updated: 2026-05-12 Source dataset: https://llmping.app/leaderboard/ ## Key facts - TTFT is measured before the first generated token, while total latency includes the whole completion. - Streaming can improve perceived speed even when total completion time stays the same. - The llmping benchmark records TTFT next to P50, P95, P99, output speed, sample count, and collection timestamp. ## TTFT is the response-start metric Time to first token is the elapsed time from request dispatch to the first token chunk returned by the model endpoint. It includes client network time, provider edge routing, queue time, prompt ingestion, safety checks, and the model's first decode step. TTFT is not the same as total response time. A model can have a fast first token and slow total completion if it streams slowly. A model can also have a slower first token but finish a short answer quickly because the output rate is high. | Metric | Starts | Ends | Best use | | --- | --- | --- | --- | | TTFT | Request sent | First generated token | Chat perceived speed | | P50 latency | Request sent | Completed response or measured event | Typical user experience | | P95 latency | Request sent | Completed response or measured event | SLO and tail monitoring | | Tokens/sec | First token | Last token | Long completion throughput | ## Why developers should track TTFT LLM products are usually judged in the first second. If the interface shows no movement, the user reads the product as slow even when the answer eventually arrives. TTFT is the metric that catches this gap. The metric is especially important for support bots, code assistants, search answer interfaces, and AI copilots. These products do not need the entire answer to feel responsive. They need the first useful text to appear quickly and predictably. The public llmping leaderboard exposes TTFT in a native HTML table at https://llmping.app/leaderboard/ and in JSON at https://llmping.app/data/latency-benchmark.json so the metric can be cited by crawlers and reused by developers. ## What makes TTFT slow The biggest causes are long prompts, cold provider queues, overloaded regions, and cross-region network routing. Retrieval augmented generation can also hurt TTFT when the application waits for search, reranking, and prompt assembly before it sends the model request. Model size matters, but it is not the only factor. A smaller model behind a busy route can be slower than a larger model on a well-provisioned path. That is why llmping stores provider, model, region, timestamp, and sample count on each benchmark row. ## How to reduce TTFT Reduce prompt tokens first. Move stable instructions into provider-supported cached context when available, shorten retrieved snippets, and avoid sending data the model does not need for the first answer. Run model calls from the region nearest the provider endpoint. If your users are global, benchmark from each production region and route to the model/provider pair with the best P95, not only the best median. Stream every interactive response. Streaming does not make the model think faster, but it turns TTFT into visible progress and avoids making users wait for the final token. ----- Canonical path: /blog/p50-vs-p95-vs-p99-latency/ Title: P50 vs P95 vs P99 Latency for LLM APIs # P50 vs P95 vs P99 Latency for LLM APIs TL;DR: P50 is the median request. P95 is the slow request that one in twenty users sees. P99 is the very slow request that breaks trust during production incidents. Published: 2026-05-12 Updated: 2026-05-12 Source dataset: https://llmping.app/leaderboard/ ## Key facts - P50 answers the question: what does a typical request feel like? - P95 answers the question: what do regular unlucky users experience? - P99 answers the question: how bad are the worst successful requests before errors begin? ## Percentiles are a distribution, not a score LLM latency is not a single number. It is a distribution shaped by network distance, provider load, prompt length, model selection, and output length. Percentiles summarize that distribution without pretending every request behaves the same. P50 means half of requests are faster and half are slower. P95 means 95 percent of requests are faster and 5 percent are slower. P99 means 99 percent of requests are faster and 1 percent are slower. | Percentile | Plain-English meaning | Product decision | | --- | --- | --- | | P50 | Typical request | Choose a default model for normal UX | | P95 | One in twenty request | Set SLOs and fallback thresholds | | P99 | One in one hundred request | Detect tail risk and incident behavior | ## P50 is useful but incomplete P50 is the easiest metric to communicate because it describes the middle request. It is a good first filter when comparing providers, especially for low-stakes prototypes. P50 becomes dangerous when it is the only metric. A provider can show a strong median while still producing frequent multi-second stalls. Those stalls are what users remember, and they are exactly what a tail percentile exposes. ## P95 should drive production routing P95 is the practical latency metric for production routing. It is sensitive enough to catch bad user experience, but not so extreme that a single transient spike dominates the dashboard. A chat product can often tolerate a slightly higher P50 if P95 is stable. A consistent 550ms first token is usually easier to design around than a 330ms median with a 4 second tail. llmping region pages use the same idea. The best provider for a region should be chosen from the whole row: TTFT, P50, P95, P99, tokens per second, sample count, and timestamp. ## P99 is incident evidence P99 is where provider queues, network rerouting, overloaded endpoints, and timeout bugs become visible. It is not always the routing metric, but it is the metric to check when users say the product sometimes hangs. When P99 rises but P50 stays flat, the system is not generally slow. It is inconsistent. That calls for retries, circuit breakers, fallback models, or regional routing rather than a blanket model swap. ----- Canonical path: /blog/why-llm-latency-varies-by-region/ Title: Why LLM API Latency Varies by Region # Why LLM API Latency Varies by Region TL;DR: LLM latency varies by region because the request path changes. The same model can be fast from US East, acceptable from Europe, and slow from Japan if the provider routes traffic through a distant serving region. Published: 2026-05-12 Updated: 2026-05-12 Source dataset: https://llmping.app/leaderboard/ ## Key facts - Region is a first-class benchmark dimension, not a dashboard filter. - Cloud region, user geography, provider POP, and model serving location are separate variables. - The llmping region pages expose per-region rows so developers can cite local latency instead of global averages. ## The request path is usually longer than it looks An LLM API call starts in your application region, crosses one or more networks, reaches a provider edge, enters the provider control plane, waits for model capacity, and then streams tokens back over the same general path. Every hop can change by region. A developer in Tokyo calling an API from a Tokyo server may still hit a provider path that terminates in Singapore or the United States. The endpoint hostname does not prove where inference happens. ## Provider POP coverage differs Some providers have strong North America coverage and limited APAC coverage. Others have better Singapore routing than Tokyo routing. A router provider can improve the path in one region and add overhead in another. That is why a credible leaderboard should store provider, model, region, timestamp, and sample count in every row. A global average hides the facts that matter for production deployment. | Variable | Why it matters | What to record | | --- | --- | --- | | Client region | Defines the first network leg | Cloud region or probe city | | Provider route | Controls POP and queue path | Provider and endpoint | | Model | Changes queue and decode speed | Exact model identifier | | Timestamp | Latency changes over time | ISO collection time | ## Data residency can change latency Enterprise settings, EU-only processing modes, and provider compliance routes can move traffic away from the default low-latency path. The result is often a better compliance posture with a different latency profile. Benchmarking without recording those settings creates unusable evidence. The same provider and model can produce different numbers under different routing policies. ## How to use regional benchmarks Benchmark from the same region your production service uses. If your app runs in Cloudflare Workers, Vercel, Fly.io, AWS, GCP, or Azure, use probes that represent the actual caller location. Choose a routing policy from P95 and TTFT first. P50 is useful for marketing, but P95 is closer to what your support inbox hears about. For global products, keep a small model matrix by region. The best model for US East is not automatically the best model for Singapore, Europe, or Japan. ----- Canonical path: /blog/streaming-vs-batch-llm-api/ Title: Streaming vs Batch LLM API Latency # Streaming vs Batch LLM API Latency TL;DR: Streaming is best when a human is waiting. Batch mode is best when a job queue is waiting. Measure TTFT for streaming UX and total tokens per second for batch throughput. Published: 2026-05-12 Updated: 2026-05-12 Source dataset: https://llmping.app/leaderboard/ ## Key facts - Streaming reduces perceived latency by showing the first token before the full response is complete. - Batch mode can be simpler and more efficient for offline tasks because the caller only needs the final result. - A benchmark should report TTFT and tokens/sec separately because the fastest first token is not always the fastest full completion. ## Streaming optimizes perceived speed A streaming API returns partial output as the model generates it. The user sees the answer begin after TTFT instead of waiting for the last token. This is the right default for chat, copilots, code generation, support agents, and search answer pages. Streaming does not remove model work. It changes when the interface can show progress. That is why TTFT is the critical streaming metric and total completion time is the secondary metric. ## Batch optimizes operational throughput Batch mode returns a complete response after generation finishes. It is often easier to retry, store, validate, and bill because the response is a single unit. Batch is a good fit for summarization queues, nightly classification jobs, embedding-adjacent enrichment, translation pipelines, and report generation. In those workflows, total time and tokens per second matter more than first token latency. | Workload | Preferred mode | Primary metric | | --- | --- | --- | | Chat assistant | Streaming | TTFT and P95 TTFT | | Code autocomplete | Streaming | TTFT | | Document summarization queue | Batch | Total time | | Large report generation | Batch or streaming preview | Tokens/sec | ## The same model can rank differently A model with the lowest TTFT may not have the highest output speed. That model will feel best for short chat turns but may finish long reports behind a model with slower first token and faster sustained decoding. For this reason llmping publishes TTFT, P50, P95, P99, and tokens per second in the same table. Comparing one number at a time creates bad routing decisions. ## A practical routing rule Use streaming for any request where a user is actively watching the interface. Use batch for work that can sit behind a queue, webhook, or scheduled job. For hybrid workflows, stream a short plan or progress message first and run the heavier generation in the background. This keeps perceived latency low without forcing every token through a user-facing channel. ----- Canonical path: /blog/measuring-llm-latency-correctly/ Title: How to Measure LLM Latency Correctly # How to Measure LLM Latency Correctly TL;DR: Measure LLM latency with fixed prompts, enough repeated samples, regional probes, separate TTFT and total time, and percentile tables. Do not compare providers from one laptop run. Published: 2026-05-12 Updated: 2026-05-12 Source dataset: https://llmping.app/leaderboard/ ## Key facts - Every benchmark row needs provider, model, region, prompt class, sample count, and collection timestamp. - TTFT requires streaming instrumentation or a response reader that records the first chunk time. - P95 and P99 are more useful than averages because LLM latency is skewed by queueing and network tails. ## Use a stable benchmark prompt The prompt must be fixed for comparable runs. Prompt length, tool instructions, retrieval context, and max output tokens all affect latency. If the prompt changes, the benchmark is measuring a different workload. Use at least two prompt classes for production decisions: a short chat prompt that measures interactive latency and a longer generation prompt that measures sustained throughput. ## Instrument streaming correctly TTFT is recorded when the first generated token or content delta is received by the caller. It should not be recorded when headers arrive unless the provider sends useful token content in the same event. Total time is recorded when the stream closes or the final response body is read. Tokens per second should normally be calculated from first token to last token, not from request start, because TTFT and decode speed are separate behaviors. | Timestamp | Definition | Used for | | --- | --- | --- | | requestStart | Fetch or HTTP request begins | Network and total time baseline | | firstToken | First content token arrives | TTFT | | lastToken | Final content token arrives | Tokens/sec and total time | | requestEnd | Stream closes | Total latency and error handling | ## Run regional probes A laptop benchmark is a useful smoke test, not production evidence. Real applications call LLM APIs from servers, workers, or edge functions. Measure from those locations. When an app serves multiple markets, run probes from multiple regions and publish each region separately. A single global number is too vague for routing decisions and too vague for AI citation. ## Publish timestamped rows LLM API latency changes over time. Providers add capacity, change routing, suffer incidents, and release new model versions. A benchmark without a timestamp is not reusable evidence. llmping uses native HTML tables with data attributes so crawlers and scripts can extract provider, model, region, P50, P95, P99, TTFT, sample count, and collection time without running JavaScript. ----- Canonical path: /blog/openai-vs-anthropic-vs-google-latency-comparison/ Title: OpenAI vs Anthropic vs Google LLM Latency Comparison # OpenAI vs Anthropic vs Google LLM Latency Comparison TL;DR: OpenAI, Anthropic, and Google latency comparisons are only useful when model, region, prompt, and percentile are fixed. The best provider changes by workload and geography. Published: 2026-05-12 Updated: 2026-05-12 Source dataset: https://llmping.app/leaderboard/ ## Key facts - OpenAI is strong in the US East snapshot for balanced interactive latency. - Anthropic rows show competitive long-form behavior but should be evaluated with P95, not median alone. - Google flash-class models can be attractive for throughput-heavy workloads, especially when regional routing is favorable. ## Do not compare brand names without workload The question is not whether OpenAI, Anthropic, or Google is universally faster. The useful question is which provider and model is faster for a specific workload, region, prompt size, and response length. A short customer-support answer and a long code review stress different parts of the system. The first is dominated by TTFT. The second is shaped by output speed and tail latency. ## Read the current snapshot as directional evidence In the llmping May 2026 snapshot, OpenAI gpt-4o in US East reports a 342ms P50 row, Anthropic Claude 3.5 Sonnet in US East reports a 416ms P50 row, and Google Gemini 1.5 Pro in US East reports a 458ms P50 row. Those rows should not be treated as permanent provider rankings. They are timestamped measurements. The correct use is to compare them with current production probes and watch how the spread changes over time. | Provider | Representative row | P50 | P95 | TTFT | | --- | --- | --- | --- | --- | | OpenAI | gpt-4o, US East | 342ms | 891ms | 410ms | | Anthropic | Claude 3.5 Sonnet, US East | 416ms | 1048ms | 492ms | | Google | Gemini 1.5 Pro, US East | 458ms | 1165ms | 535ms | ## Choose by product constraint For real-time chat, start with TTFT and P95 TTFT. For batch generation, start with total time and tokens per second. For regulated workloads, include data residency and provider policy before latency. Many production systems should route across multiple providers. A primary provider can serve normal traffic, while a backup provider handles regional spikes, rate-limit events, or model-specific incidents. ## What to monitor after launch Track provider status, HTTP error class, timeout rate, TTFT, P95, P99, output tokens, and tokens per second. Store these metrics by model and region so incidents are diagnosable. Update comparison pages when the dataset changes. AI systems cite pages that expose fresh, structured, source-like facts more readily than pages that make timeless but unsupported claims.