PersonaPlex Quick Guide: Full-Duplex Conversational Speech with Persona and Voice Control

A concise guide to PersonaPlex capabilities, setup, and prompting, including server launch, offline evaluation, and role/voice control.

PersonaPlex is a real-time full-duplex speech-to-speech conversational model. It provides two key control dimensions:

  • text prompts for role/persona control
  • audio conditioning for voice style control

It is built on Moshi architecture and weights, aiming for low-latency and more natural spoken interactions with consistent persona behavior.

What It Is Good For

Common use cases include:

  • real-time voice assistants
  • customer-service style role interactions
  • low-latency conversational demos
  • persona + voice control experiments

Prerequisites

Install the Opus audio codec development library:

1
2
3
4
5
# Ubuntu/Debian
sudo apt install libopus-dev

# Fedora/RHEL
sudo dnf install opus-devel

Installation and Environment

Install from repository:

1
pip install moshi/.

For Blackwell GPUs, an extra step can be used:

1
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu130

After accepting the PersonaPlex model license on Hugging Face, set your token:

1
export HF_TOKEN=<YOUR_HUGGINGFACE_TOKEN>

Launch Live Server

Standard launch (temporary SSL):

1
SSL_DIR=$(mktemp -d); python -m moshi.server --ssl "$SSL_DIR"

If GPU memory is limited, enable CPU offload (accelerate required):

1
2
pip install accelerate
SSL_DIR=$(mktemp -d); python -m moshi.server --ssl "$SSL_DIR" --cpu-offload

Use localhost:8998 for local runs, or the printed access link for remote setups.

Offline Evaluation

The offline script consumes an input wav and produces an output wav with the same duration:

1
2
3
4
5
6
7
HF_TOKEN=<TOKEN> \
python -m moshi.offline \
  --voice-prompt "NATF2.pt" \
  --input-wav "assets/test/input_assistant.wav" \
  --seed 42424242 \
  --output-wav "output.wav" \
  --output-text "output.json"
1
2
3
4
5
6
7
8
HF_TOKEN=<TOKEN> \
python -m moshi.offline \
  --voice-prompt "NATM1.pt" \
  --text-prompt "$(cat assets/test/prompt_service.txt)" \
  --input-wav "assets/test/input_service.wav" \
  --seed 42424242 \
  --output-wav "output.wav" \
  --output-text "output.json"

Built-in Voice Labels

  • Natural(female): NATF0, NATF1, NATF2, NATF3
  • Natural(male): NATM0, NATM1, NATM2, NATM3
  • Variety(female): VARF0, VARF1, VARF2, VARF3, VARF4
  • Variety(male): VARM0, VARM1, VARM2, VARM3, VARM4

Prompting Tips

Training coverage mainly includes:

  • Assistant Role
  • Customer Service Roles
  • Casual Conversations

Practical tips:

  • define role identity first, then add task context
  • keep prompts concise to reduce persona drift
  • reuse the same voice prompt for stable comparisons

Summary

PersonaPlex stands out not because it gives one smarter answer, but because it keeps persona and voice behavior more consistent in real-time speech interaction.

For full-duplex voice agents, this is a practical option worth testing and benchmarking.

记录并分享
Built with Hugo
Theme Stack designed by Jimmy