Skip to content
Follow me on X
ComfyUI Wiki
NewsMicrosoft Releases VibeVoice-ASR - Speech Recognition Model Supporting 60-Minute Long Audio Single-Pass Processing

Microsoft Releases VibeVoice-ASR - Speech Recognition Model Supporting 60-Minute Long Audio Single-Pass Processing

On January 21, 2026, Microsoft officially released VibeVoice-ASR, a unified speech recognition model with 9B parameters capable of processing up to 60 minutes of audio in a single pass. Unlike traditional ASR models, VibeVoice-ASR doesn’t segment audio into small chunks for processing, thus avoiding loss of global context and speaker tracking confusion.

Core Innovation

60-Minute Single-Pass Inference Capability

VibeVoice-ASR breaks through traditional ASR’s reliance on short audio segmentation, supporting single-pass processing of continuous audio up to 60 minutes. Through a 64K token context window, the model jointly completes recognition, speaker diarization, and timestamping in a single inference process.

Traditional ASR systems typically require:

  1. Segmenting audio into short clips
  2. Performing speech recognition separately
  3. Running speaker diarization separately
  4. Post-processing timestamp alignment

This approach leads to global semantic loss and cross-segment speaker tracking failures. VibeVoice-ASR solves these problems through an end-to-end unified architecture.

Structured Transcription Output

The model can output structured transcription text containing “Who, When, What”:

  • Who: Accurately identifies different speakers
  • When: Precise timestamp annotation
  • What: High-quality text transcription

This structured output is particularly suitable for meeting minutes, interview transcription, podcast transcription, and other scenarios.

Custom Hotwords Support

VibeVoice-ASR supports Customized Hotwords functionality, allowing users to inject specific:

  • Proper nouns
  • Technical terminology
  • Background vocabulary

This significantly improves recognition accuracy for domain-specific or low-frequency words, particularly suitable for professional scenarios like medical, legal, and technical conferences.

Technical Architecture

Qwen2-Based Decoder

VibeVoice-ASR’s architecture is based on Qwen2 Decoder, including:

  • 28 layers of Transformer layers
  • 3584 hidden dimensions
  • Acoustic and semantic dual encoders
  • Diffusion head design

64K Token-Level Long Context

Utilizing ultra-long context windows, the model achieves:

  • ASR (Automatic Speech Recognition)
  • Diarization (Speaker Diarization)
  • Timestamping

Joint end-to-end output of all three, forming a complete speech understanding loop.

Flash-Attention Optimization

Core computation relies on Flash-Attention technology, optimizing inference efficiency for ultra-long sequences, ensuring high performance when processing 60-minute audio.

Performance

Comprehensive Performance Optimization

Through joint training, VibeVoice-ASR has competitive advantages in the following metrics:

  • DER (Diarization Error Rate): Significantly reduced
  • cpWER (Character Error Rate with timestamps): Superior to traditional methods

Standardized Deployment Environment

Supports NVIDIA PyTorch Container (verified versions 24.07 to 25.12), ensuring stable performance across different hardware environments.

Application Scenarios

VibeVoice-ASR is particularly suitable for:

Meeting Minutes

  • Automatically generate complete meeting minutes
  • Accurately label each speaker
  • Precise timestamps for easy review

Interview Transcription

  • Complete transcription of long interviews
  • Multi-person conversation speaker separation
  • Accurate recognition of professional terminology

Podcast Transcription

  • Single-pass processing of long audio content
  • Maintains global semantic coherence
  • Automatically generates timeline

Professional Domains

  • Medical: Case discussions, surgical records
  • Legal: Court records, testimony transcription
  • Technical: Technical conferences, training courses

Open Source and Availability

VibeVoice-ASR is open-sourced on Hugging Face with test demos, using MIT open-source license, supporting:

  • Free commercial use
  • Local deployment
  • Secondary development

Access

VibeVoice Series

VibeVoice-ASR is part of the VibeVoice family, which also includes:

  • VibeVoice-TTS: Text-to-speech model
  • VibeVoice-Realtime-0.5B: Real-time speech synthesis model (only 0.5B parameters, 300ms response time)

All models use a unified technical framework:

  • Continuous speech tokenizer (7.5 Hz)
  • Next-token diffusion framework
  • LLM reasoning for text and dialogue
  • Diffusion head generates acoustic details

Technical Significance

The release of VibeVoice-ASR marks important progress in speech recognition technology:

  1. Unified Architecture: Integrates multiple independent tasks into a single model
  2. Long Context Processing: Breaks through traditional ASR length limitations
  3. End-to-End Optimization: Avoids information loss from multi-stage processing
  4. Professional Support: Adapts to various vertical domains through hotword mechanism

This provides a more powerful and flexible solution for speech recognition in professional scenarios.