How to use the command 'piper' (with examples)
Piper is a local neural text-to-speech (TTS) system designed to quickly convert written text into spoken words using sophisticated machine learning models. It facilitates the generation of natural-sounding speech in various languages and voices, offering a range of customization options. Users can leverage different strategies to synthesize speech outputs, for purposes ranging from simple text narration to complex, interactive voice applications. This article explores several use cases for utilizing the Piper command line tool effectively.
Use case 1: Output a WAV file using a text-to-speech model
Code:
echo Thing to say | piper -m path/to/model.onnx -f outputfile.wav
Motivation:
This use case demonstrates a basic application where you want to convert text directly into a WAV audio file using a specific text-to-speech model. Generating an audio file can be particularly useful for applications like creating downloadable content, podcast intros, or integrating voice responses into apps.
Explanation:
echo Thing to say
: Theecho
command outputs “Thing to say,” which serves as the input text that Piper will convert into speech.| piper
: The pipe (|
) passes the output fromecho
as input text to Piper.-m path/to/model.onnx
: Specifies the model file (in ONNX format) that contains the neural network architecture and learned weights for generating speech.-f outputfile.wav
: Specifies the filename for the output WAV file where the synthesized speech will be stored.
Example Output:
An audio file, outputfile.wav
, is generated containing the spoken equivalent of “Thing to say” using the specified model.
Use case 2: Output a WAV file using a model and specifying its JSON config file
Code:
echo 'Thing to say' | piper -m path/to/model.onnx -c path/to/model.onnx.json -f outputfile.wav
Motivation:
Specifying a JSON configuration file allows you to define custom parameters or adjustments for how the Piper model processes text. This use case is particularly beneficial when using models that might require specific configurations to perform optimally, such as speaker characteristics or synthesis parameters.
Explanation:
echo 'Thing to say'
: Outputs “Thing to say” as input text for Piper.| piper
: Directs theecho
output to Piper.-m path/to/model.onnx
: Points to the ONNX model file used for speech synthesis.-c path/to/model.onnx.json
: Specifies a JSON configuration file containing additional settings or adjustments.-f outputfile.wav
: Defines the output file for the resultant WAV audio.
Example Output:
The result is an audio file, outputfile.wav
, with voices synthesized according to the specifications within the model and its JSON configuration.
Use case 3: Select a particular speaker in a voice with multiple speakers by specifying the speaker’s ID number
Code:
echo 'Warum?' | piper -m de_DE-thorsten_emotional-medium.onnx --speaker 1 -f angry.wav
Motivation:
Piper models with multiple speakers embedded allow for versatile voice features, suited for dynamic, multi-character applications like audiobooks or interactive voice systems. Selecting a specific speaker ID ensures that the synthesized speech matches the desired voice, which is crucial in settings that require consistent character representation or emotional expression.
Explanation:
echo 'Warum?'
: Provides “Warum?” as the input text to Piper.| piper
: Passes the input fromecho
into Piper for processing.-m de_DE-thorsten_emotional-medium.onnx
: Designates the specific model file designed for German speech with emotional tones.--speaker 1
: Chooses speaker ID 1 within a multi-speaker model to create a voice with distinct characteristics.-f angry.wav
: Defines the output WAV file where the audio will be saved.
Example Output:
The resulting angry.wav
file contains the speech rendition of “Warum?” using the first speaker of the specified model, potentially reflecting an emotional tone.
Use case 4: Stream the output to the MPV media player
Code:
echo 'Hello world' | piper -m en_GB-northern_english_male-medium.onnx --output-raw -f - | mpv -
Motivation:
Streaming audio output directly to a media player like MPV is advantageous for real-time text-to-speech applications where immediate playback is desired. This use case could be instrumental for live demonstrations, voice alerts, or interactive presentations, eliminating the need to first store the output as a file.
Explanation:
echo 'Hello world'
: Sets “Hello world” as the input text for Piper.| piper
: Routes theecho
output into Piper.-m en_GB-northern_english_male-medium.onnx
: Selects a model optimized for British English in male voices.--output-raw
: Generates unprocessed raw audio data, ideal for streaming purposes.-f -
: Indicates that the output should be sent to standard output (-
) rather than a file.| mpv -
: Streams the audio directly to MPV for immediate playback.
Example Output:
The voice output of “Hello world” can be heard immediately through the MPV media player, providing live-streamed audio based on the defined model.
Use case 5: Speak twice as fast, with huge gaps between sentences
Code:
echo 'Speaking twice the speed. With added drama!' | piper -m foo.onnx --length_scale 0.5 --sentence_silence 2 -f drama.wav
Motivation:
Modifying the speaking rate and introducing dramatic pauses between sentences can be valuable in content that benefits from maximum impact, such as dramatic readings, storytelling, or public speaking applications. This allows for emphasis and suspense, enhancing the auditory engagement of the audience.
Explanation:
echo 'Speaking twice the speed. With added drama!'
: Places the phrase “Speaking twice the speed. With added drama!” as input text.| piper
: Transfers the content for processing by Piper.-m foo.onnx
: Points to the model file used for speech synthesis.--length_scale 0.5
: Adjusts the speech rate to half the usual length, resulting in a doubled speaking speed.--sentence_silence 2
: Inserts a two-second pause between sentences, applying dramatic breaks in speech delivery.-f drama.wav
: Denotes the filename for the resultant audio file.
Example Output:
The audio file named drama.wav
is produced, featuring a rapid rendition of text with pronounced pauses between sentences, creating an effect suitable for dramatic content.
Conclusion:
Piper command line tool provides a flexible and powerful solution for text-to-speech conversion tasks. The examples provided demonstrate how to leverage its features to produce high-quality, customized speech outputs for various applications. Users can select models suitable for their needs, configure speech characteristics, handle multi-speaker environments, and apply real-time streaming or dramatic effects, making Piper a versatile tool in the realm of neural text-to-speech systems.