HeyGen API Guide: 6 APIs for AI Video and Voice Creation

Summary

Explore the six core HeyGen APIs and learn when to use Video Generation, Video Agent, Template, Translation, Proofread, and Text to Speech for scalable AI video workflows.

The HeyGen API is a suite of REST endpoints for generating, translating, and voicing AI video without cameras, studios, or editing teams. It is organized into six core APIs: Video Generation, Video Agent, Template, Video Translation, Proofread, and Text to Speech. This guide explains what each API does, which endpoint it calls, which parameters matter, and the jobs it is built for, so you can pick the right one for your use case.

If you are evaluating HeyGen for programmatic video, the short version is this: use Video Generation or Video Agent to create a video; the Template API to scale variations of it; Video Translation and Proofread to localize it; and Text to Speech when you only need the audio.

The 6 HeyGen APIs at a glance

Video Generation

What it does: Creates an avatar-led video from a script or audio file

Endpoint: POST /v3/videos

Best for: Onboarding, L&D, product how-tos

Video Agent

What it does: Turns a text prompt into a finished video end to end

Endpoint: POST /v3/video-agents

Best for: Wiki or knowledge base to video, fast drafts

Template

What it does: Generates on-brand video variations from a reusable template

Endpoint: POST /v2/template/{template_id}/generate

Best for: Personalized video at scale

Video Translation

What it does: Translates and dubs a video into 175+ languages with lip-sync

Endpoint: POST /v3/video-translations

Best for: Localizing launches and training

Proofread

What it does: Extracts an editable transcript to review before translating

Endpoint: POST /v3/video-translations/proofreads

Best for: Accuracy control before localization

Text to Speech

What it does: Synthesizes natural speech audio from text

Endpoint: POST /v3/voices/speech

Best for: Voiceovers, narration, audio tracks

All six use the same authentication and async conventions, covered in the common conventions section below.

What is the HeyGen Video Generation API?

The HeyGen Video Generation API creates an avatar-led video from a text script or a pre-recorded audio file, with no camera or studio required. It is the foundational way to produce a single avatar delivering a message, and it is built to automate onboarding and L&D videos.

What it does:

Drives a HeyGen avatar, including studio avatars, digital twins, and photo avatars, by either a text script paired with a voice_id, or an uploaded audio track for lip-sync via audio_url or audio_asset_id. Script and audio are mutually exclusive.
Supports the Avatar IV and Avatar V engines. Avatar IV is the default, and you set the engine field to select Avatar V for eligible avatars. Avatar III generation uses the legacy v1 or v2 API.
Outputs at 4k, 1080p, or 720p, in aspect ratios including 16:9, 9:16, 4:5, 1:1, and auto, as either an MP4 or a WebM with a transparent background.
Adds backgrounds and background removal, burned-in or sidecar captions, and a custom watermark for select Enterprise customers.
For photo avatars on Avatar IV, it accepts a motion_prompt and an expressiveness level to control body motion.

When to use it: You have a script or audio track and want one avatar to deliver it programmatically, at volume, for onboarding, training, or product walkthroughs.

Endpoint: POST /v3/videos

What is the HeyGen Video Agent API?

The HeyGen Video Agent API turns a single text prompt into a finished video, handling scripting, avatar selection, scene composition, and automatic rendering. It is the fastest path from an idea or a document to a watchable first draft.

What it does:

Runs in two modes. generate is one-shot and fire-and-forget, auto-proceeding through the storyboard to produce a video. Meanwhile, chat is multi-turn, pausing for real decisions such as picking a voice, and allowing revisions and follow-up videos.
Accepts up to 20 file attachments, so you can ground a video in an internal wiki, a product doc, or a knowledge base article.
Takes optional avatar_id, voice_id, style_id, and brand_kit_id to apply specific avatars, voices, curated visual styles, and brand colors, fonts, and logos.
Auto-detects orientation from the content when orientation is not provided.

When to use it: You want a video from a prompt or a piece of internal documentation, you want a fast first draft, or you want non-technical teammates to create a video from a brief.

Endpoint: POST /v3/video-agents

What is the HeyGen Template API?

The HeyGen Template API generates video variations from a reusable template by swapping placeholder variables. You define the avatar, voice, layout, and branding once, then produce many on-brand versions at scale.

What it does:

Replaces template placeholders through a variables map. Each variable is typed as text, image, video, audio, voice, or character, and carries a type-specific properties payload, for example, replacement copy, a media URL or asset ID, a voice_id, or a character_id for an avatar or talking photo.
Restricts a render to a subset of scenes with scene_ids, overrides output dimension and fps, and adds burned-in subtitles.
Applies a brand glossary for translation and pronunciation rules, organizes output into a folder, and can render in test mode at lower quality without deducting quota.

When to use it: You need personalized video at scale, such as account-based sales videos or localized variants of one layout, while keeping brand consistency across every render.

Endpoint: POST /v2/template/{template_id}/generate. Note this is a v2 endpoint.

What is the HeyGen Video Translation API?

The HeyGen Video Translation API translates and dubs an existing video into one or more target languages, with voice cloning and lip-sync. It localizes training and product launches in 175+ languages and dialects with 99% lip-sync accuracy.

What it does:

Returns one video_translation_id per language. Pass a single language for one translation, or several for a batch.
Offers two quality modes. speed is the default for fast turnaround. precision produces higher lip-sync quality using avatar inference.
Includes controls for translate_audio_only, captions, speaker separation via speaker_num, partial translation with start_time and end_time, background music removal, and speech enhancement.
Applies a brand glossary so custom terms translate correctly, for example, treating "Reformer" as the Pilates equipment rather than a political activist.

When to use it: You have a finished source video and need faithful, lip-synced versions in other markets for launches, training, or product education.

Endpoint: POST /v3/video-translations

What is the HeyGen Proofread API?

The HeyGen Proofread API extracts editable subtitles from a video, enabling you to review and correct the transcript before final translation and rendering. It is the quality-assurance step before Video Translation.

What it does:

Creates a proofread session that surfaces the source transcript as editable subtitles, so you can fix names, jargon, brand terms, or transcription errors before any languages are produced.
Carries the same localization controls as the Translation API, including brand glossary, speaker_num, the speed and precision modes, music removal, and speech enhancement.
Accepts one or more output_languages, so you can prepare a single proofread or batch several at once.

When to use it: Transcript accuracy matters before you localize, for example, with technical terminology, regulated content, or brand names that must not be mistranslated.

Endpoint: POST /v3/video-translations/proofreads

What is the HeyGen Text to Speech API?

The HeyGen Text to Speech API synthesizes speech audio from text using a chosen voice. It is a standalone voice engine for narration and audio tracks, with strong consistency, low latency, and emotional control.

What it does:

Uses voices that support the starfish engine. Find compatible voices with GET /v3/voices?engine=starfish.
Accepts plain text or SSML markup, synthesizes up to 5000 characters per request, and supports a speed multiplier from 0.5 to 2.0x.
Auto-detects the language from the text, or lets you set it explicitly with a language or a BCP-47 locale tag.
Returns a URL to the generated audio file along with its duration and optional word-level timestamps.

When to use it: You need a voiceover or narration track on its own, or audio that you then feed into the Video Generation API for lip-sync. The word-level timestamps are useful for captioning and precise synchronization.

Endpoint: POST /v3/voices/speech

Which HeyGen API should you use?

Match the job to the API:

I want a video of an avatar reading my script. Use the Video Generation API.
I want a video from just a prompt or a wiki article. Use the Video Agent API.
I need many on-brand variations of the same video. Use the Template API.
I have a finished video and need it in other languages. Use the Video Translation API.
I want to correct the transcript before translating. Use the Proofread API.
I only need an audio voiceover. Use the Text to Speech API.

How the HeyGen APIs work together

The six APIs are designed to chain into pipelines:

Prompt to localized video: Draft with the Video Agent API or script with the Video Generation API, run the Proofread API to verify the transcript, then dub into target languages with the Video Translation API.
Audio-first production: Generate a track with the Text to Speech API, then lip-sync an avatar to it through the Video Generation API using audio_url or audio_asset_id.
Scaled personalization: Build a layout once and render hundreds of variants with the Template API, optionally supplying voices from Text to Speech.

They also share building blocks. Brand glossary IDs are reused across Translation, Proofread, and Template to keep terminology consistent, and async completion is reported through callback_url webhooks on the endpoints that support them.

Common conventions across the HeyGen API

Authentication: Every endpoint requires your HeyGen API key in the x-api-key header. Obtain it from your HeyGen dashboard.
Safe retries: Mutation endpoints accept an optional Idempotency-Key header. A retry within 24 hours that reuses the key replays the original response, so you can retry safely without creating duplicate jobs.
Asynchronous results: Video jobs are long-running. Provide a callback_url, and optionally a callback_id, to receive a webhook when rendering completes instead of polling.

Getting started

Pick the API that matches your job from the table above, authenticate with your x-api-key header, and provide a callback_url to receive results when rendering finishes. Full request and response schemas are in the HeyGen API documentation for every endpoint.

Check out the API docs

Frequently asked questions

What is the HeyGen API?

The HeyGen API is a set of REST endpoints for creating, translating, and voicing AI video programmatically, without cameras or studios. It spans six core APIs covering generation, prompt-to-video, templated video, translation, transcript proofreading, and text to speech.

How many APIs does HeyGen have, and what are they?

HeyGen offers six core video APIs: Video Generation, Video Agent, Template, Video Translation, Proofread, and Text to Speech.

Which HeyGen API translates video?

The HeyGen Video Translation API translates and dubs video into other languages with voice cloning and lip-sync, at POST /v3/video-translations.

What is the difference between the Video Generation API and the Video Agent API?

The Video Generation API renders an avatar speaking a script or audio you provide, giving you direct control over the avatar, voice, and output. The Video Agent API takes a single prompt and handles scripting, avatar selection, scene composition, and rendering for you, which is faster but less manual.

Does HeyGen have a text to speech API?

Yes. The HeyGen Text to Speech API synthesizes speech from text using starfish-engine voices, supports plain text or SSML, a 0.5 to 2.0x speed range, and returns an audio URL with duration and optional word-level timestamps, at POST /v3/voices/speech.

How many languages does HeyGen Video Translation support?

The HeyGen Video Translation API supports 175+ languages and dialects with 99% lip-sync accuracy.

What is the HeyGen Proofread API for?

The HeyGen Proofread API extracts an editable transcript from a video, enabling you to correct names, jargon, and errors before translating. This improves the accuracy of the final localized videos.

Can I generate personalized videos at scale with HeyGen?

Yes. The HeyGen Template API lets you define avatar, voice, layout, and branding once, then render many variations by passing values into typed template variables, at POST /v2/template/{template_id}/generate.

How do I authenticate with the HeyGen API?

Send your HeyGen API key in the x-api-key request header. You can obtain the key from your HeyGen dashboard.

Written byTony Faccenda

HeyGen API guide: The 6 core APIs and when to use each

The 6 HeyGen APIs at a glance

Video Generation

Video Agent

Template

Video Translation

Proofread

Text to Speech

What is the HeyGen Video Generation API?

What is the HeyGen Video Agent API?

What is the HeyGen Template API?

What is the HeyGen Video Translation API?

What is the HeyGen Proofread API?

What is the HeyGen Text to Speech API?

Which HeyGen API should you use?

How the HeyGen APIs work together

Common conventions across the HeyGen API

Getting started

Frequently asked questions

What is the HeyGen API?

How many APIs does HeyGen have, and what are they?

Which HeyGen API translates video?

What is the difference between the Video Generation API and the Video Agent API?

Does HeyGen have a text to speech API?

How many languages does HeyGen Video Translation support?

What is the HeyGen Proofread API for?

Can I generate personalized videos at scale with HeyGen?

How do I authenticate with the HeyGen API?

Continue Reading

Free AI Lip Sync Generator

The 15 Best D-ID Alternatives in 2026, Tested on One Script

The 15 Best Hour One Alternatives for 2026 After the Wix Acquisition

Start creating videos with AI