Synthome Docs
Generation

Audio Generation

Generate audio with AI models using generateAudio()

Audio Generation

Use generateAudio() to create AI-generated speech from text using text-to-speech models.

Basic Usage

import { compose, generateAudio, audioModel } from "@synthome/sdk";

const execution = await compose(
  generateAudio({
    model: audioModel("elevenlabs/turbo-v2.5", "elevenlabs"),
    text: "Welcome to Synthome, the composable AI media toolkit.",
    voiceId: "EXAVITQu4vr4xnSDxMaL",
  }),
).execute();

console.log(execution.result?.url);

Options

Options vary by model. Here are common options:

ElevenLabs

OptionTypeDescription
modelAudioModelThe audio model to use
textstringText to convert to speech
voiceIdstringElevenLabs voice ID

Hume

OptionTypeDescription
modelAudioModelThe audio model to use
textstringText to convert to speech

ElevenLabs TTS

Generate speech with ElevenLabs voices:

const execution = await compose(
  generateAudio({
    model: audioModel("elevenlabs/turbo-v2.5", "elevenlabs"),
    text: "Hello! This is a test of the ElevenLabs text-to-speech system.",
    voiceId: "EXAVITQu4vr4xnSDxMaL", // Sarah voice
  }),
).execute();
VoiceIDDescription
SarahEXAVITQu4vr4xnSDxMaLSoft, friendly female
Rachel21m00Tcm4TlvDq8ikWAMCalm, professional female
AdampNInz6obpgDQGcFmaJgBDeep, authoritative male
JoshTxGEqnHWrfWFTfGW9XjXConversational male

Find more voices in the ElevenLabs Voice Library.

Hume TTS

Generate emotionally expressive speech with Hume:

const execution = await compose(
  generateAudio({
    model: audioModel("hume/tts", "hume"),
    text: "I'm so excited to share this news with you!",
  }),
).execute();

Hume automatically detects emotion from the text and adjusts the voice accordingly.

Using with Video Generation

Generate audio and use it for lip-sync video:

const execution = await compose(
  generateVideo({
    model: videoModel("veed/fabric-1.0", "fal"),
    prompt: "A professional presenter",
    image: "https://example.com/portrait.jpg",
    audio: generateAudio({
      model: audioModel("elevenlabs/turbo-v2.5", "elevenlabs"),
      text: "Welcome to our product demonstration. Today I'll show you...",
      voiceId: "EXAVITQu4vr4xnSDxMaL",
    }),
  }),
).execute();

The audio is generated first, then passed to the video model for lip-sync.

Using with Merge

Add generated audio as a voiceover:

const execution = await compose(
  merge([
    "https://example.com/video.mp4",
    {
      url: generateAudio({
        model: audioModel("elevenlabs/turbo-v2.5", "elevenlabs"),
        text: "This is the voiceover for the video.",
        voiceId: "EXAVITQu4vr4xnSDxMaL",
      }),
      offset: 2, // Start 2 seconds into the video
      volume: 0.8,
    },
  ]),
).execute();

Available Models

ModelProviderFeatures
elevenlabs/turbo-v2.5elevenlabs, replicateFast TTS, voice selection
hume/ttshumeEmotionally expressive TTS

Transcription Models

For speech-to-text (transcription), use these models with the captions() operation:

ModelProviderFeatures
openai/whisperreplicateSentence-level timestamps
vaibhavs10/incredibly-fast-whisperreplicateWord-level timestamps, fast

Next Steps

How is this guide?