In recent years, text-to-speech (TTS) models have advanced enormously. Month after month, new releases improved pronunciation, prosody, and overall audio quality. Unfortunately, most of those models were designed for English, or at best compatible with only a few languages.

The tools available for generating Basque voices used to be quite limited: paid proprietary systems (such as Elhuyar’s neural TTS, developed by Orai), or older robotic voices that had clearly fallen behind.

Fortunately, this landscape has started to change over the last few months.

Maider and Antton, models released by the HiTZ center

The HiTZ center has released its TTS models. Within the ILENIA project, they published two voices: Antton (male) and Maider (female). These models can be run locally on your own computer or, if preferred, used directly through Aholab’s website without installing anything.

Basque voices in Piper TTS

Using those two voices released by HiTZ, Urtzi Odriozola adapted them for Piper TTS. Piper TTS is a lightweight and fast text-to-speech engine. Thanks to this, we can generate synthetic voices quickly even on modest devices.

If you want, without installing anything, I also created a website to synthesize these voices locally using WebAssembly.

OmniVoice (Xiaomi)

The best-known TTS models that became popular in recent years (Coqui, Kokoro, StyleTTS2, VibeVoice, …) share several common features:

  • Voice cloning: in addition to text, you provide a short reference recording and the output is generated in that voice
  • Emotion control: the ability to generate expressions like laughter, surprise, sighs, etc., via tags

And of course, more “realistic” voices that are increasingly hard to distinguish from real recordings.

Well, surprise: Xiaomi Corp. recently published a model that checks all those boxes: OmniVoice, capable of synthesizing more than 600 languages.

From the tests I ran, it performs very well in Basque, with high naturalness. It also supports emotion control, for example by adding the [laugher] tag to inject laughter. In practice, this emotion control is not yet very reliable in Basque, but still, it feels like an important step forward.

"[surprise-wa] this is an audio generated from a Basque text!"

On the other hand, although it can run on a regular computer, inference is relatively heavy and slow. Generating audio for even a short, simple sentence can take around 5 minutes. So in practice, it will likely be more common to run this kind of model on machines or servers with a GPU.

Regarding voice cloning, based on my tests I recommend using a short reference audio clip (<20s), preferably without long silence segments. Generation also takes longer in this mode (more than twice as long).

For now, the OmniVoice demo site can be used to run tests.

Conclusions

When we need instant synthesis, and especially to run locally on almost any device, Basque Piper TTS models are now a practical option. It can be a great companion to the Basque Parakeet speech recognition model.

On the other hand, when we need high-quality Basque voice generation with greater control over the output, OmniVoice looks like a very interesting option. That said, for this use case you will likely need a computer or server with a GPU.