Create Accessible Captions and Transcripts

Captions and transcripts are valuable tools that provide access to audio content. Captions describe dialogue, music, and sound effects in videos, while transcripts are documents that describe audio-only and video-only content, such as podcasts and animations.

To learn more about when and why to use captions and transcripts, review the resource Creating Accessible Synchronized Media.

General Guidelines for Captions

Captions are typically required for videos that have an audio track. When creating captions for any pre-recorded video with audio, follow these general guidelines:

Synchronize the captions to the corresponding audio in the audio track. The text and the speech or sound that the text describes must appear at the same time.
Use appropriate spelling, grammar, and punctuation. Captions must have at least 99% accuracy to be readable.
Keep captions on screen long enough for viewers to read the text.
Keep captions off the screen when no meaningful sounds are introduced.
Use a consistent style throughout the captions for identifying speakers, sound effects, and music.

General Guidelines for Transcripts

When creating transcripts for videos and audio-only and video-only content, follow these guidelines:

For audio-only content, use the same guidelines as captions for determining which sounds to transcribe and what audio information must be included.
For video content, use the same guidelines as audio descriptions for determining which imagery to describe and what visual information must be included.
Ensure that users can access the transcript in the same place as the original content.
Provide the transcript in an accessible format, like a Section 508-conformant web page, a plain text file, or a conformant Microsoft Word document.

Creating a Caption-Friendly Video

The best way to make content accessible is to plan for it from the beginning. You can make your video caption-friendly as you plan and develop it and avoid several major accessibility issues when it’s time to make the captions.

When creating video content, follow these guidelines:

Avoid overlapping voices whenever possible.
Have speakers talk at a normal or slow pace. When the speech in a video is too fast, a captioner has to choose between keeping the captions on screen for too short of a time or allowing the captions to fall out of sync with the audio. Generally, speech that is over 180 words per minute (an average of 3 words per second) may be too fast for captions.
Keep any on-screen text in the top two-thirds of the screen. Most video players display the captions in the lower third section of the screen, and may not allow you to move the captions when video text appears in that area.

Displaying Captions

When determining how captions appear on a video, follow these guidelines:

Ensure that the font style, size, and color meet all Section 508 requirements for readable body text. Section 508 best practice is to use a sans serif font, like Helvetica or Arial. As a default, use an 18-point font size and white text on a black translucent background. Adjust or change these as needed to ensure readability for the video player used.
Use the same caption text and background color for all captions. Do not change the text color or the caption background color, since users with color blindness cannot see these differences.
Use no more than two lines of text at a time, with no more than 45 characters per line (though fewer characters per line is ideal).
Display the captions in the center of the lower one-third section of the video, except when it blocks important text, like signs or person identifiers.
Avoid scrolling, flashing, and other distracting animation effects. The text must remain in the same position long enough for the viewer to read it.
If you can customize the settings available within the video player, allow users to change caption settings, like the font size, color, and placement. Ensure that the captions are written so that changing these settings does not change their meaning, like when a change in the font size changes where the captions appear on the screen.

Example

In Figure 1, the font, text size, color, and placement all meet Section 508 requirements and best practices. It is in the lower part of the screen, centered, and the text is in a sans serif font with a font size of 18 points. The text is white with a slightly translucent background, and is 44 characters long.

Figure 1. Example of correct font, text size, color, and placement

Automatically Generated Captions

Some programs automatically generate captions using speech-to-text technology. This can be a helpful starting point, especially for content creators who do not work with captions on a regular basis. However, current auto-captioning technology does not meet the minimum standard for pre-recorded media. They typically don’t use grammar and punctuation correctly, they often include spelling errors, and they often include awkward line breaks that make them much harder to read.

Because of this, always take time to edit the captions when you start with auto-captions.

When working with auto-captioning, follow these guidelines:

Add appropriate capitalization, punctuation, and grammar.
Check that the spelling of all names, places, and terms are correct.
Correct any spelling errors where a similar word was substituted, like if the program transcribed “weight” instead of “wait.”
Add captions for sound effects, music, and other audio elements that the auto-captions fail to capture.
Adjust the line breaks so that each individual caption starts and ends at the most logical point, usually at a period or comma.

Incorrect example

In Figure 2, caption text reads: “soon. If doctor low agrees we can start the”

In this incorrect example, the text is hard to read because of auto-caption errors.

Figure 2. Incorrect automatically generated captions.

Correct Example

In Figure 3, caption text reads: “If Doctor Lowe agrees, we can start the treatment”

In this correct example, the grammar errors are fixed, the starting and ending time for each caption is adjusted for a more logical flow, and the captions are much more readable.

Figure 3. Automatically generated captions after being corrected for readability.

Describing How a Person Speaks

Sometimes, it is not enough just to write what words are spoken. Often, the way a person speaks conveys meaning, and must be included in the captions.

Here are some situations where additional descriptors may need to be added:

The speaker’s emotion, volume, or tone is not obvious, so you can add a descriptor before the dialogue in brackets, like [angrily], [whispering], or [sarcastically].
A speaker uses a language other than the video’s default language. This is indicated in brackets or parentheses, like [in Spanish].
The speaker emphasizes specific words, which should be in italics.
The speaker makes a sound that is not speech, like coughing or whistling. Treat these sounds as sound effects (see below).

In all cases, read the text itself and determine whether the text alone conveys the full meaning. If it does not, use short descriptors to provide equivalent information.

Example

In Figure 4, caption text reads: “(shouting) Pick me! I wanna go first!”

Figure 4. Captions describing how a person speaks to convey meaning.

Unintelligible Speech and Filler Words

Sometimes a speaker is difficult to understand. It may be because of background noise, gaps or distortion in the audio (like on a video call with a bad connection). It can also be due to the speech itself, which may include stutters, sentence fragments, accents and slang, or excessive filler words.

If the speech is difficult to understand, follow these guidelines:

When possible, include all of the speech word-for-word, including all filler words.
If an entire section has no understandable speech, include a descriptor like “unintelligible.”
Indicate short gaps, breaks, or pauses using em dashes “—” or ellipses “…”
When necessary, include descriptors to clarify intention, like “garbled with static,” or “haltingly.”
If there is an excessive amount of filler words like “um” or “you know,” and they obscure the meaning or make the captions difficult to read, some or all of them may be omitted. In these cases, ensure that the same meaning is still conveyed and the viewer has equivalent access to the content as it is experienced by hearing viewers.

Example with Unedited Text

In Figure 5, caption text reads: “(calmly) I agree, but, um, I think…I think we could”

In this example, there are filler words and gaps, but the content is still understandable. Because of this, the captions must display exactly what the speaker said, word for word.

Figure 5. Captions displaying unedited unintelligible speech and filler words.

Example with Edited Text

In Figure 6, caption text before editing reads: “So, um, if…I mean, if we can’t, um…if we can’t find, uh, you know, a solution to, you know,”

In Figure 7, caption text after editing, with words in bold removed reads: “So, um, if…I mean, if we can’t…if we can’t find…a solution to…this case,”

In this example, several of the filler words are removed, since it allows the reader to understand what is being said on an equivalent level as a hearing user. Notice that the captions still show some filler words and gaps to keep the intent and meaning as similar as possible.

Figure 6. Filler words caption example before editing.

Figure 7. Filler words caption example after editing.

Sound Effects

Captions illustrate what is being said, but they also convey other meaningful audio, like background noise or sound effects.

For sound effects, follow these guidelines:

Indicate sound effects with brackets or parentheses, like [static] or (doorbell).
Do not caption obvious or trivial sounds, like footsteps when it is visible that the person is walking.
Indicate when ongoing background noise starts, like traffic sounds, or the chatter of a large crowd.
When necessary, describe the sound in more detail, like “louder cheering” or “overlapping barking, meowing, and chirping.”
When meaningful, describe when an ongoing sound stops, like “engine turns off.”
Describe when there is unexpected silence, like “no audible dialogue” when it is visible that a speaker’s lips are moving.

Example

In Figure 8, caption text reads: “(door opens; several dogs barking)”

Figure 8. Captions displaying sound effects.

Offscreen Speakers

When a speaker is not visible, it can be difficult to identify who is talking. If a speaker is offscreen or if the viewer cannot easily identify who is speaking by looking at the video, identify the speaker by name, or by their role if their name is not known (For example, “Ms. Harris,” or “Officer,” or “Narrator”).

Multiple Speakers

In general, videos should not have multiple speakers talking at the same time. When these situations are unavoidable, how you caption it depends on the specific context.

For background chatter where no specific words can be understood, treat it as a sound effect, like “cheering,” or “children arguing.”

When the speech can be understood, all of it must be included in the captions. This takes priority over ensuring the text is perfectly synchronized.

When possible, keep the order of the speakers consistent in the captions. For example, you may caption a news clip where each news caster’s speech is placed on the same side of the screen as that speaker. Or, when captioning a song, the lead vocalist’s lyrics appear in the top line, and the lyrics for the backup vocals appear below in parentheses.

Clearly communicate who is speaking when it is not immediately obvious from the video. Always identify the speakers when they start speaking, and use names, initials, or other indicators any time after that when it is not obvious who is speaking.

Captioning Different Languages

When a video includes speech in multiple languages, the captions must provide equivalent access to the speech as those who can hear the dialogue.

When content includes speech in multiple languages, follow these guidelines:

If the speech is fully translated for hearing viewers, either with dubbing or subtitles, the captions must include the exact same translation.
Always include a descriptor to show when the spoken language changes. For example, include the descriptor (in Spanish) when a person starts speaking Spanish, then include the descriptor (in English) when the dialogue switches to English.
If the speech is not translated for hearing viewers:
Whenever possible, include exact wording in that language, using appropriate grammar, spelling, and punctuation for that language. For example: “Hola, ¿cómo está?”
If an exact transcription of the speech is not available, at least communicate any other meaningful details about the speech, like the tone. For example, “arguing in Korean.”

Captioning Music

When a video includes music, follow these captioning guidelines:

When a song starts, list the name and artist for the song, or communicate the tone or style of the song. For example, [Johann Pachelbel’s “Canon in D Major”], or [mellow techno music].
Include song lyrics if a person is visibly singing, if there is not much meaningful speech or other sounds, or when the lyrics are significant to understanding the scene. Omit lyrics when there is already a lot of speech occurring and the specific lyrics are not needed.
Indicate any important changes to the music, like when it starts or stops abruptly, or if the tone changes.

Profanity

If the audio content contains swearing, slurs, or other offensive language, follow these guidelines:

If the profanity is audible, include the exact words in the captions.
If the profanity is censored in the audio, indicate that the word or phrase was censored. For example, “I feel like (censored).”
Never replace profanity with euphemisms or substitute words that were not used in the original audio, like replacing “killed” with “unalived.”

Reviewed/Updated: May 2024