Latency optimization

Learn how to optimize text-to-speech latency.

This guide covers the core principles for improving text-to-speech latency.

While there are many individual techniques, we’ll group them into four principles.

Four principles

  1. Use Flash models
  2. Leverage streaming
  3. Consider geographic proximity
  4. Choose appropriate voices

Enterprise customers benefit from increased concurrency limits and priority access to our rendering queue. Contact sales to learn more about our enterprise plans.

Use Flash models

Flash models deliver ~75ms inference speeds, making them ideal for real-time applications. The trade-off is a slight reduction in audio quality compared to Multilingual v2.

75ms refers to model inference time only. Actual end-to-end latency will vary with factors such as your location & endpoint type used.

Leverage streaming

There are three types of text-to-speech endpoints available in our API Reference:

  • Regular endpoint: Returns a complete audio file in a single response.
  • Streaming endpoint: Returns audio chunks progressively using Server-sent events.
  • Websockets endpoint: Enables bidirectional streaming for real-time audio generation.

Streaming

Streaming endpoints progressively return audio as it is being generated in real-time, reducing the time-to-first-byte. This endpoint is recommended for cases where the input text is available up-front.

Streaming is supported for the Text to Speech API, Voice Changer API & Audio Isolation API.

Websockets

The text-to-speech websocket endpoint supports bidirectional streaming making it perfect for applications with real-time text input (e.g. LLM outputs).

Setting auto_mode to true automatically handles generation triggers, removing the need to manually manage chunk strategies.

If auto_mode is disabled, the model will wait for enough text to match the chunk schedule before starting to generate audio.

For instance, if you set a chunk schedule of 125 characters but only 50 arrive, the model stalls until additional characters come in—potentially increasing latency.

For implementation details, see the text-to-speech websocket guide.

Choose appropriate voices

We have observed that in some cases, voice selection can impact latency. Here’s the order from fastest to slowest:

  1. Default voices (formerly premade), Synthetic voices, and Instant Voice Clones (IVC)
  2. Professional Voice Clones (PVC)

Higher audio quality output formats can increase latency. Be sure to balance your latency requirements with audio fidelity needs.

We are actively working on optimizing PVC latency for Flash v2.5.

Consider geographic proximity

We serve our models from multiple regions to optimize latency based on your geographic location.

You can check which backend region is serving your request by inspecting the x-region header in the API response. Currently used regions include: USA, Netherlands and Singapore.

Enterprise customers can use our dedicated EU and India data residency environments for guarantees about the server location as well as low latency. Contact your sales representative to get onboarded to our data residency infrastructure.

To opt-out of the global routing and always use USA servers, use the api.us.www.11labs.ru base URL for your API requests:

1import os
2from elevenlabs.client import ElevenLabs
3
4elevenlabs = ElevenLabs(
5 api_key=os.getenv("ELEVENLABS_API_KEY"),
6 base_url="https://api.11labs.ru"
7)

The global servers were previously opt-in using the api-global-preview.www.11labs.ru base URL. This is no longer necessary as it is now the default behavior. Please update your applications to simply use api.www.11labs.ru instead.