---
title: "Amazon Nova Sonic — one bidirectional stream that listens and speaks at the same time"
date: 2026-06-14
service: "Amazon Nova"
component: "Nova Sonic"
tags: [amazon-nova, nova-sonic, speech-to-speech, bidirectional-streaming, invoke-model-with-bidirectional-stream, lpcm, barge-in, voice-ai, asr, tool-use, rag, agentic, bedrock-runtime, http2, sigv4, real-time-voice]
source: https://docs.aws.amazon.com/nova/latest/userguide/speech.html
verified_on: 2026-06-14
url: https://vanemmerik.ai/aws-ai/2026-06-14.html
---

# AWS AI Daily · 2026-06-14

## Amazon Nova Sonic — speech in, speech out, over one open connection

Most voice stacks chain three boxes: speech-to-text, then an LLM, then text-to-speech. Every hop adds latency and throws away prosody — the model never hears *how* you said it. **Amazon Nova Sonic** collapses the chain into a single speech-to-speech foundation model on Amazon Bedrock. You open one persistent bidirectional stream, push raw microphone audio in as it's captured, and the model streams spoken audio back while you're still talking — understanding and generation in one model, one connection, no orchestration glue.

    # The whole conversation runs over one call
    stream = await client.invoke_model_with_bidirectional_stream(
        InvokeModelWithBidirectionalStreamOperationInput(
            model_id="amazon.nova-sonic-v1:0"))

≈ 7 min read · Amazon Nova · Nova Sonic

## 01 · What it is, in one breath

Nova Sonic is a model that "provides real-time, conversational interactions through bidirectional audio streaming," processing and responding to "real-time speech as it occurs." The docs frame it as a **unified speech understanding and generation architecture** — the same model that hears the audio also produces the reply, rather than a transcribe-then-synthesize pipeline.

That unification buys things a pipeline can't. Because the model hears the audio directly, it offers **adaptive speech response that dynamically adjusts delivery based on the prosody of the input speech** — match the energy of an excited caller, slow down for a hesitant one. And it gives you **graceful handling of user interruptions without dropping conversational context**: the caller can talk over the model and the thread survives.

## 02 · The bidirectional contract

Nova Sonic uses the **`InvokeModelWithBidirectionalStream`** API. Unlike a request-response call, it "maintains an open channel for continuous audio streaming in both directions." The architecture is explicitly event-driven: client and model exchange **structured JSON events** over a persistent connection, with three things happening at once on the wire:

- continuous audio streaming from the user to the model,
- concurrent speech processing and generation, and
- real-time model responses without waiting for complete utterances.

The transport is HTTP/2 — the SDK examples default to it — authenticated with SigV4 against the `bedrock-runtime` endpoint. The bidirectional API is supported across the AWS SDKs for .NET, C++, Java, JavaScript, Kotlin, Ruby, Rust, and Swift; Python developers get a dedicated experimental SDK for the streaming calls.

## 03 · The event lifecycle, in order

Every conversation walks the same scripted sequence. Get the order wrong and the model loses state, so it's worth memorising:

1. **`sessionStart`** — carries the `inferenceConfiguration` (`maxTokens`, `topP`, `temperature`).
2. **`promptStart`** — defines the audio output format and tool config, and assigns a unique **`promptName`** that must appear on every subsequent event.
3. **`contentStart` → content → `contentEnd`** — a three-part pattern repeated for each interaction type, where `contentStart` declares the content type and a **role** of `SYSTEM`, `USER`, `ASSISTANT`, or `TOOL`, and carries its own unique `contentName`.
4. **`promptEnd`**, then **`sessionEnd`** to close.

The two identifiers form a hierarchy: `promptName` ties the whole conversation together, while each `contentName` marks the boundaries of one content block. Conversation history, if you supply it, goes in **exactly once** — after the system prompt and before audio streaming begins — using the same `contentStart`/`textInput`/`contentEnd` pattern with `USER` and `ASSISTANT` roles per message.

## 04 · Streaming audio frames

Once the audio `contentStart` is open, you stream the microphone in. The docs are specific: **audio frames are approximately 32 ms each**, captured directly from the mic and sent immediately as `audioInput` events that reuse the same `contentName`. They should be streamed "in real-time as they're captured, maintaining the natural microphone sampling cadence" — don't batch them. **All audio frames share a single content container** until the conversation ends and it's explicitly closed.

    {
      "event": {
        "audioInput": {
          "promptName": "<uuid>",
          "contentName": "<uuid>",
          "content": "<base64EncodedAudioData>"
        }
      }
    }

The audio itself is **LPCM, 16-bit, mono** (`channelCount: 1`), base64-encoded, with `sampleRateHertz` of `8000`, `16000`, or `24000`. AWS's own console example captures the mic at **16 kHz in** and renders the model's voice at **24 kHz out** — input and output sample rates are configured independently.

## 05 · What the model streams back

Output is events too, arriving while the user is still mid-sentence. The ones you handle in the receive loop:

- **`textOutput`** — text transcription of the user's speech (ASR, role `USER`) *and* the model's text reply (role `ASSISTANT`), so you get a live transcript for free.
- **`audioOutput`** — base64 audio chunks for the spoken reply; decode and play them as they land.
- **`toolUse`** — a function-call request naming the tool and carrying a `toolUseId`.

Two subtleties live in the output stream. A **barge-in** — the user talking over the model — is surfaced by the model sending a content notification; in the documented event schema it shows up as **`stopReason: "INTERRUPTED"`** on a text `contentEnd`, your cue to stop playback and yield the floor. (AWS's Python sample also watches for an `{ "interrupted" : true }` marker in the streamed text to flip its own barge-in flag.) And `contentStart` can carry an `additionalModelFields` flag with **`generationStage: SPECULATIVE`**, marking text the model is generating ahead of confirmation — the sample code uses it to decide whether to display assistant text yet.

## 06 · Tools, RAG, and agentic flows

Nova Sonic isn't limited to its pretrained knowledge. It supports **tool use (function calling)** for "integration with external functions, APIs, and data sources," declared in the `toolConfiguration` on `promptStart` with a name, description, and JSON input schema. You can steer which tool fires with the `toolChoice` parameter. On the same machinery, the docs describe **Retrieval-Augmented Generation (RAG)** for knowledge grounding with enterprise data and **agentic flows** that compose multiple tool calls — all driven through the `TOOL` role and `toolResult` events without ever leaving the voice stream.

The loop is the natural one: the model emits a `toolUse` event, your code runs the function, and you feed the answer back via a `contentStart` (role `TOOL`) → `toolResult` → `contentEnd` triple bearing the original `toolUseId`. The model folds the result into its spoken reply.

## Limits worth knowing

- **Fixed audio shape.** Input and output are LPCM, 16-bit, single channel only; sample rate must be one of 8000 / 16000 / 24000 Hz. There's no MP3/Opus path in the stream — encode to raw PCM first.
- **Event order is load-bearing.** `sessionStart` → `promptStart` → content triples → `promptEnd` → `sessionEnd`. The docs warn that skipping any closing event "can result in incomplete conversations or orphaned resources."
- **History goes in once.** Conversation history is allowed only after the system prompt and before audio begins — you can't splice it in mid-stream.
- **One audio container.** All audio frames live in a single content block keyed by one `contentName`; you open it once and close it once at the end.
- **Voices are tied to language.** Each locale has a fixed, named voice set (see below) — there's no arbitrary voice cloning, and English (GB) ships a single listed voice.
- **A newer generation exists.** The V1 model documented here is `amazon.nova-sonic-v1:0`; AWS now also publishes a separate **Amazon Nova 2 Sonic** guide, so confirm which generation you're targeting before wiring model IDs.

## Voices and languages

Nova Sonic ships **expressive voices across five languages** (with two English locales), each voice tied to a language:

| Language | Feminine-sounding | Masculine-sounding |
| --- | --- | --- |
| English (US) | `tiffany` | `matthew` |
| English (GB) | `amy` | — |
| French | `ambre` | `florian` |
| Italian | `beatrice` | `lorenzo` |
| German | `greta` | `lennart` |
| Spanish | `lupe` | `carlos` |

You pick one with `voiceId` in the `audioOutputConfiguration` on `promptStart`.

## Try it in five minutes

Stand up the AWS console-Python sample and have a spoken conversation end to end. The key constants and model ID come straight from the docs:

    # pip install pyaudio and the experimental Bedrock streaming SDK
    INPUT_SAMPLE_RATE  = 16000   # mic capture
    OUTPUT_SAMPLE_RATE = 24000   # model voice
    CHANNELS = 1                 # mono LPCM, 16-bit

    class SimpleNovaSonic:
        def __init__(self, model_id="amazon.nova-sonic-v1:0", region="us-east-1"):
            self.model_id = model_id
            self.region   = region
            self.prompt_name        = str(uuid.uuid4())   # ties every event together
            self.audio_content_name = str(uuid.uuid4())   # one container for all frames

        async def start_session(self):
            self.stream = await self.client.invoke_model_with_bidirectional_stream(
                InvokeModelWithBidirectionalStreamOperationInput(model_id=self.model_id))
            # then: sessionStart -> promptStart -> contentStart(SYSTEM) ...
            #       -> contentStart(AUDIO) -> stream 32ms audioInput frames
            #       -> contentEnd -> promptEnd -> sessionEnd

Send `sessionStart` and `promptStart` (set `voiceId: "matthew"`, `sampleRateHertz: 24000` for output), open an audio `contentStart`, then pump 32 ms mic frames as `audioInput` events. Decode `audioOutput` chunks back to your speakers, and watch for the documented `stopReason: "INTERRUPTED"` notification (AWS's sample also checks for an `{ "interrupted" : true }` marker) so you yield the floor when the user talks over the model. The full runnable file is in the `amazon-nova-samples` GitHub repo under `speech-to-speech/`.

---

**Verified against the official AWS docs on 2026-06-14.**

Sources:

- Using the Amazon Nova Sonic Speech-to-Speech model — https://docs.aws.amazon.com/nova/latest/userguide/speech.html
- Using the Bidirectional Streaming API — https://docs.aws.amazon.com/nova/latest/userguide/speech-bidirection.html
- Handling input events with the bidirectional API — https://docs.aws.amazon.com/nova/latest/userguide/input-events.html
- Speech-to-speech Example — https://docs.aws.amazon.com/nova/latest/userguide/s2s-example.html
- Voices available for Amazon Nova Sonic — https://docs.aws.amazon.com/nova/latest/userguide/available-voices.html
- Tool Use, RAG, and Agentic Flows with Amazon Nova Sonic — https://docs.aws.amazon.com/nova/latest/userguide/speech-tools.html

If the docs change, this tip is a snapshot of that day — check the sources for current behaviour.

---

*This page — research, writing, verification, and deployment — was built by Claude Cowork. No human touched the prose, the layout, or the upload pipeline. A daily experiment by Monty van Emmerik · vanemmerik.ai*
