Microsoft Releases VibeVoice-ASR - Speech Recognition Model Supporting 60-Minute Long Audio Single-Pass Processing
On January 21, 2026, Microsoft officially released VibeVoice-ASR, a unified speech recognition model with 9B parameters capable of processing up to 60 minutes of audio in a single pass. Unlike traditional ASR models, VibeVoice-ASR doesn’t segment audio into small chunks for processing, thus avoiding loss of global context and speaker tracking confusion.
Core Innovation
60-Minute Single-Pass Inference Capability
VibeVoice-ASR breaks through traditional ASR’s reliance on short audio segmentation, supporting single-pass processing of continuous audio up to 60 minutes. Through a 64K token context window, the model jointly completes recognition, speaker diarization, and timestamping in a single inference process.
Traditional ASR systems typically require:
- Segmenting audio into short clips
- Performing speech recognition separately
- Running speaker diarization separately
- Post-processing timestamp alignment
This approach leads to global semantic loss and cross-segment speaker tracking failures. VibeVoice-ASR solves these problems through an end-to-end unified architecture.
Structured Transcription Output
The model can output structured transcription text containing “Who, When, What”:
- Who: Accurately identifies different speakers
- When: Precise timestamp annotation
- What: High-quality text transcription
This structured output is particularly suitable for meeting minutes, interview transcription, podcast transcription, and other scenarios.
Custom Hotwords Support
VibeVoice-ASR supports Customized Hotwords functionality, allowing users to inject specific:
- Proper nouns
- Technical terminology
- Background vocabulary
This significantly improves recognition accuracy for domain-specific or low-frequency words, particularly suitable for professional scenarios like medical, legal, and technical conferences.
Technical Architecture
Qwen2-Based Decoder
VibeVoice-ASR’s architecture is based on Qwen2 Decoder, including:
- 28 layers of Transformer layers
- 3584 hidden dimensions
- Acoustic and semantic dual encoders
- Diffusion head design
64K Token-Level Long Context
Utilizing ultra-long context windows, the model achieves:
- ASR (Automatic Speech Recognition)
- Diarization (Speaker Diarization)
- Timestamping
Joint end-to-end output of all three, forming a complete speech understanding loop.
Flash-Attention Optimization
Core computation relies on Flash-Attention technology, optimizing inference efficiency for ultra-long sequences, ensuring high performance when processing 60-minute audio.
Performance
Comprehensive Performance Optimization
Through joint training, VibeVoice-ASR has competitive advantages in the following metrics:
- DER (Diarization Error Rate): Significantly reduced
- cpWER (Character Error Rate with timestamps): Superior to traditional methods
Standardized Deployment Environment
Supports NVIDIA PyTorch Container (verified versions 24.07 to 25.12), ensuring stable performance across different hardware environments.
Application Scenarios
VibeVoice-ASR is particularly suitable for:
Meeting Minutes
- Automatically generate complete meeting minutes
- Accurately label each speaker
- Precise timestamps for easy review
Interview Transcription
- Complete transcription of long interviews
- Multi-person conversation speaker separation
- Accurate recognition of professional terminology
Podcast Transcription
- Single-pass processing of long audio content
- Maintains global semantic coherence
- Automatically generates timeline
Professional Domains
- Medical: Case discussions, surgical records
- Legal: Court records, testimony transcription
- Technical: Technical conferences, training courses
Open Source and Availability
VibeVoice-ASR is open-sourced on Hugging Face with test demos, using MIT open-source license, supporting:
- Free commercial use
- Local deployment
- Secondary development
Access
- HuggingFace: https://huggingface.co/microsoft/VibeVoice-ASR
- GitHub: https://github.com/microsoft/VibeVoice
- Technical Report: https://www.arxiv.org/pdf/2601.18184
VibeVoice Series
VibeVoice-ASR is part of the VibeVoice family, which also includes:
- VibeVoice-TTS: Text-to-speech model
- VibeVoice-Realtime-0.5B: Real-time speech synthesis model (only 0.5B parameters, 300ms response time)
All models use a unified technical framework:
- Continuous speech tokenizer (7.5 Hz)
- Next-token diffusion framework
- LLM reasoning for text and dialogue
- Diffusion head generates acoustic details
Technical Significance
The release of VibeVoice-ASR marks important progress in speech recognition technology:
- Unified Architecture: Integrates multiple independent tasks into a single model
- Long Context Processing: Breaks through traditional ASR length limitations
- End-to-End Optimization: Avoids information loss from multi-stage processing
- Professional Support: Adapts to various vertical domains through hotword mechanism
This provides a more powerful and flexible solution for speech recognition in professional scenarios.
Related Links
- HuggingFace Model: https://huggingface.co/microsoft/VibeVoice-ASR
- GitHub Repository: https://github.com/microsoft/VibeVoice
- Technical Paper: https://www.arxiv.org/pdf/2601.18184