Synthome Docs
Operations

Captions

Add auto-generated or custom subtitles to videos

captions()

Add captions to videos with automatic transcription or custom timing.

import { compose, captions, audioModel } from "@synthome/sdk";

const execution = await compose(
  captions({
    video: "https://example.com/video.mp4",
    model: audioModel("vaibhavs10/incredibly-fast-whisper", "replicate"),
  }),
).execute();

Auto-Generated Captions

Use a transcription model to automatically generate captions:

captions({
  video: "https://example.com/video.mp4",
  model: audioModel("vaibhavs10/incredibly-fast-whisper", "replicate"),
});

Available Transcription Models

ModelProviderSpeedNotes
vaibhavs10/incredibly-fast-whisperreplicateVery fastRecommended for most
openai/whisperreplicateStandardOriginal Whisper model
// Standard Whisper
captions({
  video: "https://example.com/video.mp4",
  model: audioModel("openai/whisper", "replicate"),
});

Transcription Correction

When using TTS-generated audio, transcription models like Whisper often misrecognize brand names, technical terms, or uncommon words. Use originalText to automatically correct these errors:

captions({
  video: "https://example.com/video.mp4",
  model: audioModel("vaibhavs10/incredibly-fast-whisper", "replicate"),
  originalText: "Synthome makes video editing easy with AI-powered tools.",
});

How It Works

  1. Whisper transcribes the audio and returns word-level timestamps
  2. The transcription is compared against your original text using AI
  3. Misrecognized words are corrected while preserving the original timestamps

Common Use Cases

Brand name correction:

// Without originalText: "Sintom" or "Sin Thome"
// With originalText: "Synthome" (correctly spelled)
captions({
  video: ttsGeneratedVideo,
  model: audioModel("vaibhavs10/incredibly-fast-whisper", "replicate"),
  originalText: "Synthome is the best video platform.",
});

Technical terms:

captions({
  video: productDemo,
  model: audioModel("vaibhavs10/incredibly-fast-whisper", "replicate"),
  originalText: "Configure your Kubernetes cluster with kubectl apply.",
});

Multi-language Support

The correction works with any language since it uses AI to match the transcription against the original text:

captions({
  video: frenchVideo,
  model: audioModel("vaibhavs10/incredibly-fast-whisper", "replicate"),
  originalText: "Bienvenue sur notre plateforme Synthome.",
});

The originalText parameter requires an OpenAI API key configured in your integrations. The correction uses GPT-4o-mini for fast, accurate results.

Custom Captions

Provide your own word-level timing:

captions({
  video: "https://example.com/video.mp4",
  captions: [
    { word: "Hello", start: 0.0, end: 0.5 },
    { word: "world", start: 0.5, end: 1.0 },
    { word: "this", start: 1.2, end: 1.4 },
    { word: "is", start: 1.4, end: 1.6 },
    { word: "a", start: 1.6, end: 1.7 },
    { word: "video", start: 1.7, end: 2.2 },
  ],
});

Caption Styles

Style Presets

Use built-in presets for popular platforms:

captions({
  video: "https://example.com/video.mp4",
  model: audioModel("vaibhavs10/incredibly-fast-whisper", "replicate"),
  style: { preset: "tiktok" },
});
PresetDescription
tiktokBold, centered, mobile-optimized
youtubeClean, bottom-positioned
storyVertical video friendly
minimalSubtle, unobtrusive
cinematicFilm-style subtitles

Custom Font Styling

captions({
  video: "https://example.com/video.mp4",
  model: audioModel("vaibhavs10/incredibly-fast-whisper", "replicate"),
  style: {
    fontFamily: "Arial",
    fontSize: 48,
    fontWeight: "bold",
    letterSpacing: 2,
    color: "#FFFFFF",
    outlineColor: "#000000",
    outlineWidth: 2,
  },
});

Background Styling

Add a background box behind your captions:

captions({
  video: "https://example.com/video.mp4",
  model: audioModel("vaibhavs10/incredibly-fast-whisper", "replicate"),
  style: {
    fontFamily: "Arial",
    fontSize: 48,
    fontWeight: "normal",
    color: "#FFFFFF",
    backgroundColor: "#000000",
    padding: 20,
  },
});

When backgroundColor is set, an opaque box is automatically added behind the text. Use padding to control the space between the text and the box edges.

Positioning

captions({
  video: "https://example.com/video.mp4",
  model: audioModel("vaibhavs10/incredibly-fast-whisper", "replicate"),
  style: {
    alignment: "center",
    marginV: 50, // Vertical margin from bottom
    marginL: 20, // Left margin
    marginR: 20, // Right margin
  },
});

Word Highlighting

Highlight the currently spoken word:

captions({
  video: "https://example.com/video.mp4",
  model: audioModel("vaibhavs10/incredibly-fast-whisper", "replicate"),
  style: {
    highlightActiveWord: true,
    activeWordColor: "#FFFF00", // Yellow highlight
    inactiveWordColor: "#FFFFFF", // White for other words
  },
});

Animation Styles

style: {
  highlightActiveWord: true,
  animationStyle: "color",  // Options: "none", "color", "scale", "glow"
  activeWordScale: 1.2,     // Scale up active word
}

Caption Behavior

Control how captions are grouped and displayed:

captions({
  video: "https://example.com/video.mp4",
  model: audioModel("vaibhavs10/incredibly-fast-whisper", "replicate"),
  style: {
    wordsPerCaption: 5, // Show 5 words at a time
    maxCaptionDuration: 3, // Max 3 seconds per caption
    maxCaptionChars: 40, // Max 40 characters per line
  },
});

With Generated Videos

Caption a Generated Video

import {
  compose,
  captions,
  generateVideo,
  videoModel,
  audioModel,
} from "@synthome/sdk";

const execution = await compose(
  captions({
    video: generateVideo({
      model: videoModel("bytedance/seedance-1-pro", "replicate"),
      prompt: "Person giving a presentation",
    }),
    model: audioModel("vaibhavs10/incredibly-fast-whisper", "replicate"),
    style: { preset: "youtube" },
  }),
).execute();

Caption After Merge

const execution = await compose(
  captions({
    video: merge([
      generateVideo({
        model: videoModel("bytedance/seedance-1-pro", "replicate"),
        prompt: "Scene 1",
      }),
      generateVideo({
        model: videoModel("bytedance/seedance-1-pro", "replicate"),
        prompt: "Scene 2",
      }),
    ]),
    model: audioModel("vaibhavs10/incredibly-fast-whisper", "replicate"),
  }),
).execute();

This pipeline:

  1. Generates two videos in parallel
  2. Merges them into one
  3. Transcribes and adds captions

Reusing Transcripts

When you need the same transcript for multiple operations (e.g., captions and position keyframes), use the transcribe() function to create a reusable transcript:

import {
  compose,
  transcribe,
  captions,
  generatePositionKeyframes,
  layers,
  audioModel,
} from "@synthome/sdk";

// Create a reusable transcript
const transcript = transcribe({
  video: "https://example.com/speaking-head.mp4",
  model: audioModel("vaibhavs10/incredibly-fast-whisper", "replicate"),
});

// Use the same transcript for captions
const captionedVideo = captions({
  video: "https://example.com/speaking-head.mp4",
  transcript: transcript, // Reuse the transcript
  style: { preset: "tiktok" },
});

// And for position keyframes
const positions = generatePositionKeyframes({
  timestamps: transcript, // Same transcript
  positions: ["w-2/3 bottom-left", "w-2/3 bottom", "w-2/3 bottom-right"],
});

When to Use Transcript

  • Multiple uses: When you need timestamps for both captions and position keyframes
  • Audio-first workflow: When transcribing generated audio before video creation
  • Pipeline optimization: Avoids duplicate transcription jobs

Transcribing Audio Directly

You can transcribe audio files directly without a video:

import { transcribe, generateAudio, audioModel } from "@synthome/sdk";

// Transcribe generated TTS audio
const transcript = transcribe({
  audio: generateAudio({
    model: audioModel("elevenlabs/turbo-v2.5", "elevenlabs"),
    text: "Welcome to our video!",
    voiceId: "21m00Tcm4TlvDq8ikWAM",
  }),
  model: audioModel("vaibhavs10/incredibly-fast-whisper", "replicate"),
  originalText: "Welcome to our video!", // Correct transcription
});

This is faster than transcribing from video since no audio extraction is needed.

Full Style Reference

CaptionStyle

PropertyTypeDescription
presetstringStyle preset (tiktok, youtube, etc.)
fontFamilystringFont name
fontSizenumberFont size in pixels
fontWeightstring | numberFont weight ("normal", "bold", 400, 700)
letterSpacingnumberLetter spacing in pixels
colorstringText color (hex)
outlineColorstringOutline color (hex)
backgroundColorstringBackground box color (hex)
paddingnumberPadding around text in background box
borderStylenumber1=Outline only, 3=Opaque background box
outlineWidthnumberOutline width in pixels (when borderStyle is 1)
shadowDistancenumberShadow offset
alignmentstringText alignment
marginVnumberVertical margin
marginLnumberLeft margin
marginRnumberRight margin
wordsPerCaptionnumberWords shown at once
maxCaptionDurationnumberMax seconds per caption
maxCaptionCharsnumberMax characters per caption
highlightActiveWordbooleanEnable word highlighting
activeWordColorstringColor for active word
inactiveWordColorstringColor for inactive words
activeWordScalenumberScale multiplier for active word
animationStylestringAnimation: none, color, scale, glow

API Reference

captions(options)

ParameterTypeDescription
optionsCaptionsOptionsCaption configuration

CaptionsOptions

PropertyTypeRequiredDescription
videostring | VideoOperationYesVideo URL or generated video
modelAudioModel*Transcription model
captionsCaptionWord[]*Custom word-level captions
transcriptTranscribeOperation | string*Pre-created transcript (reusable)
originalTextstringNoOriginal text for transcription correction
styleCaptionStyleNoStyling options

* One of model, captions, or transcript is required.

CaptionWord

PropertyTypeDescription
wordstringThe word text
startnumberStart time in seconds
endnumberEnd time in seconds

How is this guide?

On this page