OpenAI's New Realtime Voice Models: GPT-Realtime-2, Live Translation, and Streaming Transcription

Sat, 09 May 2026 10:58:47 +0800

On May 7, 2026, OpenAI introduced a new generation of voice models for the Realtime API. The point is not only to make AI sound more natural, but to let voice agents understand, reason, call tools, translate, and transcribe during a live conversation.

The update includes three models:

GPT-Realtime-2: the main model for realtime voice agents, with stronger reasoning, tool calling, and longer context.
GPT-Realtime-Translate: a live speech translation model that supports 70+ input languages and 13 output languages.
GPT-Realtime-Whisper: a low-latency streaming speech-to-text model for captions, meeting notes, and realtime workflows.

If early voice assistants were mostly “ask once, answer once,” this release moves closer to a voice interface that can listen and act at the same time.

GPT-Realtime-2: the main model for voice agents

GPT-Realtime-2 is built for live voice interactions. It does not just answer questions; it needs to keep context while the user speaks, changes direction, interrupts, or adds constraints, and then call tools when needed.

Officially highlighted capabilities include:

Short preambles before a response, such as “let me check that,” so users know the system is working.
Parallel tool calls for calendars, search, orders, support systems, and other multi-tool workflows.
More natural recovery behavior when something fails.
A context window increased from 32K to 128K for longer conversations and more complex task flows.
Better retention of specialized terminology, proper nouns, and medical vocabulary.
More controllable tone and delivery, such as calm, empathetic, confirmational, or upbeat responses.
Adjustable reasoning effort: minimal, low, medium, high, and xhigh, with low as the default.

This means developers can use voice agents in more demanding products, not only simple Q&A. A support agent can listen while checking an order; a travel app can give next steps after a flight change; a real estate assistant can filter listings and schedule a tour from spoken requirements.

Live translation for multilingual voice products

GPT-Realtime-Translate is designed for live speech translation. People can speak in their own language while the other side hears translated speech and sees realtime transcripts.

Clear use cases include:

Multilingual customer support.
Cross-border sales and pre-sales conversations.
Online education and live events.
International meetings and hosting.
Creator and video platform localization.

The hard part of live translation is not only accuracy. It also requires low latency, natural pauses, tone preservation, accent handling, and domain vocabulary. OpenAI is emphasizing cross-language conversations that feel closer to natural speech, instead of waiting for an entire segment before translation begins.

Streaming transcription: voice content enters workflows immediately

GPT-Realtime-Whisper is the new streaming speech-to-text model. Its value is turning speech into usable text while it is happening, instead of waiting for a recording to finish.

Common applications include:

Live meeting captions.
Classroom and broadcast captions.
Realtime meeting notes.
Continuous dictation input for voice agents.
Follow-up workflows in support, healthcare, recruiting, sales, and other high-volume voice scenarios.

For products, streaming transcription shortens the time from spoken words to actionable text. Captions appear faster, notes can be generated during the conversation, and downstream workflows such as summaries, task extraction, and CRM updates can start earlier.

Pricing and availability

All three models are available in the Realtime API. Official pricing is:

Model	Price
`GPT-Realtime-2`	Audio input $32 / 1M tokens, cached input $0.40 / 1M tokens, audio output $64 / 1M tokens
`GPT-Realtime-Translate`	$0.034 / minute
`GPT-Realtime-Whisper`	$0.017 / minute

OpenAI also says the Realtime API supports EU Data Residency and is covered by its enterprise privacy commitments. For European businesses or products with data residency requirements, that is worth evaluating separately.

What this means for developers

The key shift is that voice is becoming part of the product interaction layer, not just an input/output layer.

In many earlier voice features, speech was converted to text, and text responses were converted back into speech. The hard middle layer is intent understanding, interruption handling, context tracking, tool calls, tool transparency, and graceful recovery.

GPT-Realtime-2 tries to move more of that capability directly into the realtime voice model. For developers, the question is not only answer quality, but whether the model can support sustained conversations and multi-step tasks.

Products that are especially worth testing include:

Customer support voice agents.
In-car and mobile voice assistants.
Travel, booking, real estate, finance, and other services that need conversation plus lookup.
Multilingual meetings and cross-border communication tools.
Live captions, meeting notes, and call quality systems.

Safety and disclosure still matter

OpenAI says the Realtime API includes multiple safety layers, such as active classifiers over sessions and the ability to stop policy-violating conversations. Developers can also add their own guardrails through the Agents SDK.

One easily missed requirement is disclosure: developers should make it clear when end users are interacting with AI, unless that is already obvious from the context.

This matters in support, sales, education, healthcare, and similar scenarios. The more natural voice becomes, the more important product boundaries become: users should know they are talking to AI, and understand when speech may be recorded, transcribed, or used to trigger tools.

Summary

OpenAI’s Realtime API update moves live voice from “can listen and speak” toward “can listen while working through tasks.”

GPT-Realtime-2 handles more complex voice agents, GPT-Realtime-Translate handles live cross-language communication, and GPT-Realtime-Whisper handles low-latency transcription. Together, they cover the three basic capabilities most voice products need: conversation, translation, and transcription.

If you are building support, in-car, meeting, education, cross-border communication, or mobile voice assistant products, this release is worth testing. The important question is not only whether the model sounds natural, but how it performs in long conversations, interruptions, tool calls, failure recovery, and cost control.

Reference:

OpenAI: Advancing voice intelligence with new models in the API

Realtime API on KnightLi Blog