OpenMOSS Releases MOVA - Open-Source Synchronized Video and Audio Generation Model

01/29/2026

NVIDIA Releases PersonaPlex-7B-v1 - Full-Duplex Voice Dialogue Model

On January 20, 2026, NVIDIA Research officially launched PersonaPlex-7B-v1, a 7 billion parameter full-duplex speech-to-speech dialogue model based on Moshi architecture. The model abandons the traditional ASR→LLM→TTS cascade pipeline, adopting a unified Transformer architecture that synchronously processes speech understanding and generation within a single network, supporting natural interruptions, overlapping speech, rapid turn-taking, and context-aware backchannels.

Core Innovation

Full-Duplex Real-Time Interaction

The biggest breakthrough of PersonaPlex-7B-v1 lies in achieving true Full Duplex dialogue capability:

Listen While Speaking: Model can simultaneously listen to user input and generate responses
Natural Interruptions: Supports users interrupting AI speech at any time
Instant Feedback: Can produce backchannels like “uh-huh” and “right”
Authentic Rhythm: Simulates natural pauses and intonation changes in human conversation

Traditional voice AI uses a rigid three-stage process (speech recognition → large language model processing → speech synthesis). This “listen-think-speak” relay mode, while functional, always lacks natural interaction, turning dialogue into mechanical turn-based combat.

PersonaPlex-7B-v1 processes continuous audio tokens through dual-stream Transformer architecture, achieving parallel generation of text and speech without task handoffs or forced pauses.

Ultra-Low Latency Response

In performance testing, PersonaPlex-7B-v1 excels:

Turn-taking Rate: 90.8%
Interruption Response Latency: As low as 240 milliseconds
Time to First Token (TTFT): Approximately 170 milliseconds

These metrics significantly outperform existing open-source and commercial systems, providing users with a smooth experience close to real human conversation.

Hybrid Prompting Mechanism

PersonaPlex achieves precise role control through an innovative hybrid prompting mechanism:

Voice Prompt

Defines timbre and prosody
Controls speaking rate and emotional expression
Achieves high-fidelity voice cloning with just seconds of audio samples

Text Prompt

Sets role identity and business scenarios
Defines knowledge background and behavioral style
Can include structured information like names and organizations

System Prompt

Provides contextual information
Sets dialogue rules
Defines task objectives

This multi-dimensional prompting system enables PersonaPlex to flexibly adapt to various application scenarios, from professional tutors to customer service representatives, from creative virtual characters to technical support.

Technical Architecture

Unified Architecture Based on Moshi

PersonaPlex-7B-v1 is built on Moshi architecture, using end-to-end modeling:

Mimi Speech Encoder (ConvNet + Transformer): Maps raw audio to discrete text tokens
Temporal Transformer: Models conversational rhythm in the temporal dimension (when to interrupt, when to wait)
Depth Transformer: Deep parsing of semantic intent and behavioral strategies
Mimi Speech Decoder (Transformer + ConvNet): Restores token sequences to high-fidelity speech

Audio sampling rate reaches 24kHz, ensuring high-quality speech output.

Underlying Language Model: Helium

PersonaPlex uses Helium as the underlying language model, providing:

Semantic understanding capability
Ability to generalize to out-of-distribution scenarios
Powerful context modeling

Training Data

PersonaPlex’s training data combines real conversations with high-quality synthetic corpora:

Real Conversation Data

Source: Fisher English corpus
Scale: 7,303 conversations, totaling 1,217 hours
Processing: Back-annotated with prompts using GPT-OSS-120B

Synthetic Conversation Data

Teaching Assistant Scenarios

Scale: 39,322 conversations, 410 hours
Generation: Qwen3-32B and GPT-OSS-120B generate text, Chatterbox TTS synthesizes speech

Customer Service Scenarios

Scale: 105,410 conversations, 1,840 hours
Domains: Covers multiple vertical domains including education, healthcare, and finance

This hybrid training strategy ensures the model has both authenticity and generalization capability.

Performance

In authoritative benchmark tests, PersonaPlex-7B-v1 performs excellently:

Conversational Dynamics (FullDuplexBench)

PersonaPlex: 90.8
Moshi: 95.06
Freeze Omni: 60.68
Qwen 2.5 Omni: 86.53

Response Latency

PersonaPlex: 0.170 seconds
Moshi: 0.240 seconds
Freeze Omni: 0.205 seconds
Qwen 2.5 Omni: 0.953 seconds

Task Adherence

PersonaPlex: 4.29
Moshi: 4.40
Freeze Omni: 4.34
Qwen 2.5 Omni: 3.62

Application Scenarios

PersonaPlex-7B-v1 is suitable for various scenarios:

Intelligent Educational Assistance

Acts as a personalized teacher, explaining knowledge points with clear logic and vivid expression, stimulating learning interest and adapting to students of different cognitive levels.

Intelligent Customer Service

Competent in frontline positions in banking, telecommunications, insurance, and other industries, providing professional consultation based on customer needs while maintaining patient and professional service attitude.

Role-Playing and Gaming

Plays various roles in games or simulation scenarios, providing immersive interactive experiences.

Virtual Companions

Provides daily conversational companionship, able to understand emotions and provide appropriate emotional feedback.

Professional Scenarios

Such as space emergency management and other special scenarios, able to provide professional guidance with appropriate emotional tones.

Open Source and Availability

PersonaPlex-7B-v1 is fully open-source with friendly licensing:

Code: MIT License
Model Weights: NVIDIA Open Model License
Base Moshi Model: CC-BY-4.0

Developers can:

Download and use for free
Deploy and run locally
Perform secondary development and customization
Integrate into commercial applications

Access

HuggingFace: https://huggingface.co/nvidia/personaplex-7b-v1
GitHub: https://github.com/nvidia/personaplex
Research Page: https://research.nvidia.com/labs/adlr/personaplex/

Technical Significance

The release of PersonaPlex-7B-v1 marks an important breakthrough in voice AI interaction:

Architectural Innovation: From cascade pipeline to end-to-end unified processing
Natural Interaction: Truly mastering the “breathing rhythm of human conversation”
Low Barrier Deployment: Open-source model lowers technical and cost barriers for building natural conversational agents
Wide Applications: Suitable for real-time translation, immersive game NPCs, advanced in-car assistants, and other domains

By open-sourcing PersonaPlex, NVIDIA provides a locally deployable, commercially viable solution for the voice AI field, advancing the development of next-generation human-computer interaction interfaces.

HuggingFace Model: https://huggingface.co/nvidia/personaplex-7b-v1
GitHub Repository: https://github.com/nvidia/personaplex
Research Homepage: https://research.nvidia.com/labs/adlr/personaplex/