Google DeepMind has released Gemma 4 12B, a mid-sized open model that processes vision and audio inputs without dedicated encoders, instead routing those signals directly through the language model backbone. The release sits between the existing edge-focused E4B and the larger 26B Mixture of Experts model in the Gemma 4 family.

Encoder-Free Architecture

Most multimodal models depend on separate encoder modules to translate image and audio data into representations the language model can consume. That approach adds latency and memory overhead. Gemma 4 12B replaces the vision encoder with a lightweight embedding module built from a single matrix multiplication, positional embedding, and normalization layers, handing visual processing to the LLM backbone directly. Audio handling is simplified further: the audio encoder is removed entirely, and raw audio signals are projected into the same dimensional space as text tokens.

The practical result is a unified processing pipeline with a reduced memory footprint that DeepMind says fits within 16GB of VRAM or unified memory, making local deployment on consumer laptops viable.

Performance and Capabilities

According to DeepMind, benchmark performance approaches that of the 26B MoE model while consuming less than half its total memory. The model supports multi-step reasoning and agentic workflows, and is the first mid-sized Gemma model to accept native audio input alongside text and images.

Gemma 4 12B also ships with Multi-Token Prediction drafters designed to reduce inference latency, a feature DeepMind calls drafter-ready support.

Licensing and Deployment Options

The model is released under the Apache 2.0 license. Pre-trained and instruction-tuned checkpoints are available on Hugging Face and Kaggle. Inference is supported across a range of toolchains including Hugging Face Transformers, llama.cpp, MLX, SGLang, and vLLM. Fine-tuning support is available through Unsloth. Cloud deployment options include Google Cloud’s Gemini Enterprise Agent Platform Model Garden, Cloud Run, and GKE.

DeepMind also announced a Gemma Skills Repository, described as a library of agent-oriented capabilities built specifically for use with Gemma models.

Ecosystem Context

DeepMind noted that Gemma 4 models have collectively crossed 150 million downloads. The company cited community-built applications ranging from assistive robotics to enterprise AI security tooling as examples of how the model family is being used in production contexts.