Google DeepMind has released Gemini 3.1 Flash TTS, a text-to-speech model designed to give developers and enterprises tighter control over AI-generated speech. The model is available in preview through the Gemini API and Google AI Studio, on Vertex AI for enterprise users, and within Google Vids for Workspace customers.
Audio Tags and Director-Style Controls
The most operationally significant addition is a system of audio tags, which let developers embed natural language commands directly into text input to adjust vocal style, pacing, tone, and accent at a fine-grained level. Google AI Studio exposes this through a layered configuration interface described in the announcement:
- Scene direction: Developers define an environment and dialogue context to keep characters consistent across multiple conversational turns.
- Speaker-level specificity: Unique audio profiles can be assigned per speaker, with director’s notes controlling pace, tone, and accent. Inline tags allow mid-sentence expression changes.
- Seamless export: Finalized configurations export directly as Gemini API code, enabling voice consistency across projects and platforms.
Quality and Benchmark Performance
On the Artificial Analysis TTS leaderboard, which aggregates blind human preference evaluations, Gemini 3.1 Flash TTS recorded an Elo score of 1,211. Artificial Analysis placed the model in what it describes as its most attractive quadrant, citing a balance of high speech quality and low cost. The model also supports native multi-speaker dialogue and covers more than 70 languages, with style, pacing, and accent controls available across that language set.
SynthID Watermarking
Every audio output from Gemini 3.1 Flash TTS is watermarked using Google’s SynthID technology. The watermark is described as imperceptible and interwoven directly into the audio signal, enabling detection of AI-generated content without audible degradation. From a security and provenance standpoint, this is relevant for organizations concerned about synthetic media abuse: any audio produced by this model carries a detectable identifier that can be used to flag AI-generated content and help counter misinformation.
For security professionals evaluating AI-generated media risks, the SynthID integration represents a built-in provenance mechanism, though its effectiveness depends on downstream detection tooling and whether organizations in the audio pipeline actively check for watermarks. A model card covering safety and responsibility considerations is available from Google DeepMind alongside the release.
