Skip to content
Follow me on X
ComfyUI Wiki
NewsNVIDIA Releases PersonaPlex-7B-v1 - Full-Duplex Voice Dialogue Model

NVIDIA Releases PersonaPlex-7B-v1 - Full-Duplex Voice Dialogue Model

On January 20, 2026, NVIDIA Research officially launched PersonaPlex-7B-v1, a 7 billion parameter full-duplex speech-to-speech dialogue model based on Moshi architecture. The model abandons the traditional ASR→LLM→TTS cascade pipeline, adopting a unified Transformer architecture that synchronously processes speech understanding and generation within a single network, supporting natural interruptions, overlapping speech, rapid turn-taking, and context-aware backchannels.

Core Innovation

Full-Duplex Real-Time Interaction

The biggest breakthrough of PersonaPlex-7B-v1 lies in achieving true Full Duplex dialogue capability:

  • Listen While Speaking: Model can simultaneously listen to user input and generate responses
  • Natural Interruptions: Supports users interrupting AI speech at any time
  • Instant Feedback: Can produce backchannels like “uh-huh” and “right”
  • Authentic Rhythm: Simulates natural pauses and intonation changes in human conversation

Traditional voice AI uses a rigid three-stage process (speech recognition → large language model processing → speech synthesis). This “listen-think-speak” relay mode, while functional, always lacks natural interaction, turning dialogue into mechanical turn-based combat.

PersonaPlex-7B-v1 processes continuous audio tokens through dual-stream Transformer architecture, achieving parallel generation of text and speech without task handoffs or forced pauses.

Ultra-Low Latency Response

In performance testing, PersonaPlex-7B-v1 excels:

  • Turn-taking Rate: 90.8%
  • Interruption Response Latency: As low as 240 milliseconds
  • Time to First Token (TTFT): Approximately 170 milliseconds

These metrics significantly outperform existing open-source and commercial systems, providing users with a smooth experience close to real human conversation.

Hybrid Prompting Mechanism

PersonaPlex achieves precise role control through an innovative hybrid prompting mechanism:

Voice Prompt

  • Defines timbre and prosody
  • Controls speaking rate and emotional expression
  • Achieves high-fidelity voice cloning with just seconds of audio samples

Text Prompt

  • Sets role identity and business scenarios
  • Defines knowledge background and behavioral style
  • Can include structured information like names and organizations

System Prompt

  • Provides contextual information
  • Sets dialogue rules
  • Defines task objectives

This multi-dimensional prompting system enables PersonaPlex to flexibly adapt to various application scenarios, from professional tutors to customer service representatives, from creative virtual characters to technical support.

Technical Architecture

Unified Architecture Based on Moshi

PersonaPlex-7B-v1 is built on Moshi architecture, using end-to-end modeling:

  • Mimi Speech Encoder (ConvNet + Transformer): Maps raw audio to discrete text tokens
  • Temporal Transformer: Models conversational rhythm in the temporal dimension (when to interrupt, when to wait)
  • Depth Transformer: Deep parsing of semantic intent and behavioral strategies
  • Mimi Speech Decoder (Transformer + ConvNet): Restores token sequences to high-fidelity speech

Audio sampling rate reaches 24kHz, ensuring high-quality speech output.

Underlying Language Model: Helium

PersonaPlex uses Helium as the underlying language model, providing:

  • Semantic understanding capability
  • Ability to generalize to out-of-distribution scenarios
  • Powerful context modeling

Training Data

PersonaPlex’s training data combines real conversations with high-quality synthetic corpora:

Real Conversation Data

  • Source: Fisher English corpus
  • Scale: 7,303 conversations, totaling 1,217 hours
  • Processing: Back-annotated with prompts using GPT-OSS-120B

Synthetic Conversation Data

Teaching Assistant Scenarios

  • Scale: 39,322 conversations, 410 hours
  • Generation: Qwen3-32B and GPT-OSS-120B generate text, Chatterbox TTS synthesizes speech

Customer Service Scenarios

  • Scale: 105,410 conversations, 1,840 hours
  • Domains: Covers multiple vertical domains including education, healthcare, and finance

This hybrid training strategy ensures the model has both authenticity and generalization capability.

Performance

In authoritative benchmark tests, PersonaPlex-7B-v1 performs excellently:

Conversational Dynamics (FullDuplexBench)

  • PersonaPlex: 90.8
  • Moshi: 95.06
  • Freeze Omni: 60.68
  • Qwen 2.5 Omni: 86.53

Response Latency

  • PersonaPlex: 0.170 seconds
  • Moshi: 0.240 seconds
  • Freeze Omni: 0.205 seconds
  • Qwen 2.5 Omni: 0.953 seconds

Task Adherence

  • PersonaPlex: 4.29
  • Moshi: 4.40
  • Freeze Omni: 4.34
  • Qwen 2.5 Omni: 3.62

Application Scenarios

PersonaPlex-7B-v1 is suitable for various scenarios:

Intelligent Educational Assistance

Acts as a personalized teacher, explaining knowledge points with clear logic and vivid expression, stimulating learning interest and adapting to students of different cognitive levels.

Intelligent Customer Service

Competent in frontline positions in banking, telecommunications, insurance, and other industries, providing professional consultation based on customer needs while maintaining patient and professional service attitude.

Role-Playing and Gaming

Plays various roles in games or simulation scenarios, providing immersive interactive experiences.

Virtual Companions

Provides daily conversational companionship, able to understand emotions and provide appropriate emotional feedback.

Professional Scenarios

Such as space emergency management and other special scenarios, able to provide professional guidance with appropriate emotional tones.

Open Source and Availability

PersonaPlex-7B-v1 is fully open-source with friendly licensing:

  • Code: MIT License
  • Model Weights: NVIDIA Open Model License
  • Base Moshi Model: CC-BY-4.0

Developers can:

  • Download and use for free
  • Deploy and run locally
  • Perform secondary development and customization
  • Integrate into commercial applications

Access

Technical Significance

The release of PersonaPlex-7B-v1 marks an important breakthrough in voice AI interaction:

  1. Architectural Innovation: From cascade pipeline to end-to-end unified processing
  2. Natural Interaction: Truly mastering the “breathing rhythm of human conversation”
  3. Low Barrier Deployment: Open-source model lowers technical and cost barriers for building natural conversational agents
  4. Wide Applications: Suitable for real-time translation, immersive game NPCs, advanced in-car assistants, and other domains

By open-sourcing PersonaPlex, NVIDIA provides a locally deployable, commercially viable solution for the voice AI field, advancing the development of next-generation human-computer interaction interfaces.