TurboScribe modes banner

Transcription Modes, Explained

August 23, 2023
Leif Foged
Leif Foged

Today, we're delving into TurboScribe's transcription engine, focusing on its three transcription modes – Cheetah, Dolphin, and Whale.

What's the difference between these 3 modes? Here's the TLDR:

  • 🐆 Cheetah is the fastest mode. It transcribes 1 hour of audio or video in 30 seconds. It's tuned to deliver you a transcript as fast as possible.
  • 🐬 Dolphin delivers very high accuracy, while still being very fast. It takes about 3 minutes to transcribe 1 hour of audio or video.
  • 🐳 Whale is tuned for maximum accuracy. It transcribes 1 hour of audio or video in less than 10 minutes.

When uploading a file, you can choose between any of these 3 modes (🐳 Whale is the default).

So which should you choose? We recommend starting with the default (Whale) for maximum accuracy and switching to Dolphin or Cheetah when you need transcripts even faster.

For those of you who want a better idea of what's going on under the hood, feel free to keep on reading.

Whisper: More Than Meets the Ear

Audio waves

Let's take a closer look at Whisper, the AI technology behind TurboScribe's transcription.

Whisper isn't just a single AI model; it's actually a family of five models, each with varying trade-offs between accuracy and speed.

At the low end, Whisper begins with the "tiny" model (at "just" 39 million parameters) and goes all the way up to "large" (at 1.55 billion parameters).

"tiny" is the fastest, but makes the most errors. "base" and "small" are better than most humans. "large" is the most accurate (comparable to professional transcribers and translators), but requires lots of memory and expensive hardware.

Whisper's 5 Models

The Whisper family contains 5 different AI models:

  • tiny — 39 million parameters
  • base — 74 million parameters (powers TurboScribe's 🐆 Cheetah mode)
  • small — 244 million parameters (powers TurboScribe's 🐬 Dolphin mode)
  • medium — 769 million parameters
  • large — 1.55 billion parameters (powers TurboScribe's 🐳 Whale mode)

Unfortunately, unless you have a particularly powerful computer or have access to a beefy GPU (graphics processing unit), most people will struggle to efficiently run most models larger than "base".

GPUs are the secret to transcribing audio fast. Unfortunately, they're also quite expensive. As of this writing, a single Nvidia A100 — the chip "powering the race for AI" — costs $6,715.00 on Amazon.

TurboScribe uses GPUs to significantly speed up transcription and get more done, faster.

Comparing Transcription Times

Let's compare each of TurboScribe's modes on our GPU-powered transcription engine by transcribing a 1 hour audio file about World War 2.

🐆 Cheetah

Cheetah prioritizes delivering accurate transcripts at maximum speed, powered by the 74 million parameter "base" model. Here's what transcribing our 1 hour audio file looks like:

That took just 20 seconds. In other words, it's fast.

🐬 Dolphin

Dolphin, at 244 million parameters, takes a bit over twice as long (which is still pretty fast):

🐳 Whale

Finally, Whale takes about 3 minutes to transcribe the same 1 hour audio file (with the massive 1.55 billion parameter Whisper "large-v2" model):

Keep in mind that transcription times can vary slightly.

For example, transcribing a large, 4GB video file (with 2 hours of audio) will take a bit more time than a smaller 100MB MP3 file with the same 2 hours of audio — this is mostly because we have to spend more time transferring, analyzing, preprocessing, and converting your media file before we actually begin transcription.

Audio files with little detectable human speech (think an audio recording with lots of silent periods) can usually be transcribed more quickly. Furthermore, transcribing multiple files is also usually faster than transcribing a single file.

Comparing Accuracy

For many common audio and video files, there is no difference between 🐆 Cheetah, 🐬 Dolphin, and 🐳 Whale.

Where 🐬 Dolphin and 🐳 Whale really shine is in cases where contextual clues are required to disambiguate similar-sounding words.

For example, in a choppy, fast-paced legal recording with high amounts of background noise, the term "Habeas Corpus" was mistranslated as "happy is porpoise" with 🐆 Cheetah. However, based on the context of the surrounding conversation (which involved other legal terms), both 🐬 Dolphin and 🐳 Whale correctly determined that "Habeas Corpus" is the most likely transcription.

Here's another example: in an audio recording, a woman named Kristina Hernandez introduces herself and spells her name.

🐆 Cheetah incorrectly transcribes her name as "Christina" (rather than "Kristina"):

(Speaker 1) My name is Christina Hernandez. That's spelled K R I S T I N A H E R N A N D E Z. (Speaker 2) Thank you, Christina.

🐬 Dolphin incorrectly transcribes the first use of her name, but corrects the second usage of the term (after she spells out her name):

(Speaker 1) My name is Christina Hernandez. That's spelled K R I S T I N A H E R N A N D E Z. (Speaker 2) Thank you, Kristina.

🐳 Whale gets both usages correct:

(Speaker 1) My name is Kristina Hernandez. That's spelled K R I S T I N A H E R N A N D E Z. (Speaker 2) Thank you, Kristina.

Improving Accuracy With Metadata

There are cases where even a human translator can't unambigiously determine a correct transcription. For example, if Kristina had never spelled her name, it would have been impossible (based on the audio alone) to determine the correct spelling of her name.

To improve accuracy even further, TurboScribe uses metadata attached to audio and video files you upload (such as the file name, title, and description) to automatically improve transcriptions of terms that can't be unambiguously determined based on the audio alone.

For example, if the MP3 metadata title, artist, or comment references "Kristina Hernandez", all 3 modes are much more likely to transcribe her name correctly.

Wrapping Up

In summary, TurboScribe offers three transcription modes:

  • 🐆 Cheetah provides accurate transcriptions as quickly as possible.
  • 🐬 Dolphin aims for the perfect balance between accuracy and speed.
  • 🐳 Whale maximizes accuracy, but takes a bit longer. It's TurboScribe's default mode.

The best way to truly grasp their capabilities is by trying them out yourself. Start for for free and transcribe up to 4 audio or video files for free every day.

About TurboScribe

TurboScribe converts audio and video to accurate text in seconds, powered by AI.

Learn More About TurboScribe

Ready to start transcribing?

Get full access to...

Unlimited Transcriptions
Unlimited transcriptions for one person.
99.8% Accuracy
Powered by Whisper, the most accurate and powerful AI speech to text transcription technology in the world.
98+ Languages
TurboScribe supports the spoken languages of the world.
10 Hour Uploads
Each file can be up to 10 hours long / 5 GB. Upload 50 files at a time.
Speaker Recognition
Great for meetings, interviews, and podcasts.