NVIDIA Releases PersonaPlex-7B-v1 - Full-Duplex Voice Dialogue Model
On January 20, 2026, NVIDIA Research officially launched PersonaPlex-7B-v1, a 7 billion parameter full-duplex speech-to-speech dialogue model based on Moshi architecture. The model abandons the traditional ASR→LLM→TTS cascade pipeline, adopting a unified Transformer architecture that synchronously processes speech understanding and generation within a single network, supporting natural interruptions, overlapping speech, rapid turn-taking, and context-aware backchannels.
Core Innovation
Full-Duplex Real-Time Interaction
The biggest breakthrough of PersonaPlex-7B-v1 lies in achieving true Full Duplex dialogue capability:
- Listen While Speaking: Model can simultaneously listen to user input and generate responses
- Natural Interruptions: Supports users interrupting AI speech at any time
- Instant Feedback: Can produce backchannels like “uh-huh” and “right”
- Authentic Rhythm: Simulates natural pauses and intonation changes in human conversation
Traditional voice AI uses a rigid three-stage process (speech recognition → large language model processing → speech synthesis). This “listen-think-speak” relay mode, while functional, always lacks natural interaction, turning dialogue into mechanical turn-based combat.
PersonaPlex-7B-v1 processes continuous audio tokens through dual-stream Transformer architecture, achieving parallel generation of text and speech without task handoffs or forced pauses.
Ultra-Low Latency Response
In performance testing, PersonaPlex-7B-v1 excels:
- Turn-taking Rate: 90.8%
- Interruption Response Latency: As low as 240 milliseconds
- Time to First Token (TTFT): Approximately 170 milliseconds
These metrics significantly outperform existing open-source and commercial systems, providing users with a smooth experience close to real human conversation.
Hybrid Prompting Mechanism
PersonaPlex achieves precise role control through an innovative hybrid prompting mechanism:
Voice Prompt
- Defines timbre and prosody
- Controls speaking rate and emotional expression
- Achieves high-fidelity voice cloning with just seconds of audio samples
Text Prompt
- Sets role identity and business scenarios
- Defines knowledge background and behavioral style
- Can include structured information like names and organizations
System Prompt
- Provides contextual information
- Sets dialogue rules
- Defines task objectives
This multi-dimensional prompting system enables PersonaPlex to flexibly adapt to various application scenarios, from professional tutors to customer service representatives, from creative virtual characters to technical support.
Technical Architecture
Unified Architecture Based on Moshi
PersonaPlex-7B-v1 is built on Moshi architecture, using end-to-end modeling:
- Mimi Speech Encoder (ConvNet + Transformer): Maps raw audio to discrete text tokens
- Temporal Transformer: Models conversational rhythm in the temporal dimension (when to interrupt, when to wait)
- Depth Transformer: Deep parsing of semantic intent and behavioral strategies
- Mimi Speech Decoder (Transformer + ConvNet): Restores token sequences to high-fidelity speech
Audio sampling rate reaches 24kHz, ensuring high-quality speech output.
Underlying Language Model: Helium
PersonaPlex uses Helium as the underlying language model, providing:
- Semantic understanding capability
- Ability to generalize to out-of-distribution scenarios
- Powerful context modeling
Training Data
PersonaPlex’s training data combines real conversations with high-quality synthetic corpora:
Real Conversation Data
- Source: Fisher English corpus
- Scale: 7,303 conversations, totaling 1,217 hours
- Processing: Back-annotated with prompts using GPT-OSS-120B
Synthetic Conversation Data
Teaching Assistant Scenarios
- Scale: 39,322 conversations, 410 hours
- Generation: Qwen3-32B and GPT-OSS-120B generate text, Chatterbox TTS synthesizes speech
Customer Service Scenarios
- Scale: 105,410 conversations, 1,840 hours
- Domains: Covers multiple vertical domains including education, healthcare, and finance
This hybrid training strategy ensures the model has both authenticity and generalization capability.
Performance
In authoritative benchmark tests, PersonaPlex-7B-v1 performs excellently:
Conversational Dynamics (FullDuplexBench)
- PersonaPlex: 90.8
- Moshi: 95.06
- Freeze Omni: 60.68
- Qwen 2.5 Omni: 86.53
Response Latency
- PersonaPlex: 0.170 seconds
- Moshi: 0.240 seconds
- Freeze Omni: 0.205 seconds
- Qwen 2.5 Omni: 0.953 seconds
Task Adherence
- PersonaPlex: 4.29
- Moshi: 4.40
- Freeze Omni: 4.34
- Qwen 2.5 Omni: 3.62
Application Scenarios
PersonaPlex-7B-v1 is suitable for various scenarios:
Intelligent Educational Assistance
Acts as a personalized teacher, explaining knowledge points with clear logic and vivid expression, stimulating learning interest and adapting to students of different cognitive levels.
Intelligent Customer Service
Competent in frontline positions in banking, telecommunications, insurance, and other industries, providing professional consultation based on customer needs while maintaining patient and professional service attitude.
Role-Playing and Gaming
Plays various roles in games or simulation scenarios, providing immersive interactive experiences.
Virtual Companions
Provides daily conversational companionship, able to understand emotions and provide appropriate emotional feedback.
Professional Scenarios
Such as space emergency management and other special scenarios, able to provide professional guidance with appropriate emotional tones.
Open Source and Availability
PersonaPlex-7B-v1 is fully open-source with friendly licensing:
- Code: MIT License
- Model Weights: NVIDIA Open Model License
- Base Moshi Model: CC-BY-4.0
Developers can:
- Download and use for free
- Deploy and run locally
- Perform secondary development and customization
- Integrate into commercial applications
Access
- HuggingFace: https://huggingface.co/nvidia/personaplex-7b-v1
- GitHub: https://github.com/nvidia/personaplex
- Research Page: https://research.nvidia.com/labs/adlr/personaplex/
Technical Significance
The release of PersonaPlex-7B-v1 marks an important breakthrough in voice AI interaction:
- Architectural Innovation: From cascade pipeline to end-to-end unified processing
- Natural Interaction: Truly mastering the “breathing rhythm of human conversation”
- Low Barrier Deployment: Open-source model lowers technical and cost barriers for building natural conversational agents
- Wide Applications: Suitable for real-time translation, immersive game NPCs, advanced in-car assistants, and other domains
By open-sourcing PersonaPlex, NVIDIA provides a locally deployable, commercially viable solution for the voice AI field, advancing the development of next-generation human-computer interaction interfaces.
Related Links
- HuggingFace Model: https://huggingface.co/nvidia/personaplex-7b-v1
- GitHub Repository: https://github.com/nvidia/personaplex
- Research Homepage: https://research.nvidia.com/labs/adlr/personaplex/