faster-whisper: A Faster Whisper Transcription Engine

faster-whisper reimplements Whisper inference with CTranslate2, making local and server-side speech-to-text faster and more memory efficient.

faster-whisper is a Whisper inference implementation maintained by SYSTRAN. It uses CTranslate2 as the backend, keeping the workflow close to Whisper while making inference speed, memory use, and deployment flexibility more suitable for engineering work.

If you have used openai/whisper, you can think of faster-whisper as a more production-oriented alternative. The interface still centers on loading a model, transcribing audio, and reading segmented results, but the execution layer is faster and easier to tune around CPU, GPU, quantization, and batching.

What problem does it solve?

Whisper works well, but the original implementation often runs into a few issues when deployed directly:

  1. Long audio can take a noticeable amount of time to transcribe.
  2. GPU memory usage can be high.
  3. CPU execution works, but speed may not be ideal.
  4. Throughput is not always easy to scale when processing large batches of audio or video.

faster-whisper mainly optimizes around these problems. Its README states that, with the same accuracy, it can be up to 4 times faster than openai/whisper while using less memory. With 8-bit quantization, speed can improve further.

Installation

In a regular Python environment, install it directly:

1
pip install faster-whisper

If you want to use a GPU, make sure your local CUDA, cuDNN, and CTranslate2 versions are compatible. This is the easiest place to stumble: the code itself may be fine, but inference can fail when loading the model or running the first request if the GPU driver and CUDA runtime do not match.

Basic usage

The minimal example is straightforward:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
from faster_whisper import WhisperModel

model_size = "large-v3"

model = WhisperModel(model_size, device="cuda", compute_type="float16")

segments, info = model.transcribe("audio.mp3", beam_size=5)

print("Detected language '%s' with probability %f" % (info.language, info.language_probability))

for segment in segments:
    print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))

The key parameters are:

Parameter Purpose
model_size Selects the Whisper model size, such as small, medium, or large-v3
device Inference device, commonly cuda or cpu
compute_type Compute precision, such as float16, int8_float16, or int8
beam_size Decoding search width; larger values are usually more stable but slower

If your goal is quick local transcription, start by testing medium or large-v3. If GPU memory is tight, then consider quantization.

Choosing CPU or GPU

With an NVIDIA GPU, prefer:

1
model = WhisperModel("large-v3", device="cuda", compute_type="float16")

If GPU memory is not enough, switch to:

1
model = WhisperModel("large-v3", device="cuda", compute_type="int8_float16")

Without a GPU, run it on CPU:

1
model = WhisperModel("small", device="cpu", compute_type="int8")

CPU mode is better for lightweight jobs, low-frequency background tasks, or servers without a graphics card. For a large amount of long audio, GPU is still the better fit.

Batched transcription

faster-whisper also provides batched transcription. Batching is useful for many short audio files or when you need higher GPU throughput:

1
2
3
4
5
6
7
8
from faster_whisper import WhisperModel, BatchedInferencePipeline

model = WhisperModel("turbo", device="cuda", compute_type="float16")
batched_model = BatchedInferencePipeline(model=model)
segments, info = batched_model.transcribe("audio.mp3", batch_size=16)

for segment in segments:
    print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))

batch_size is not always better when larger. It improves throughput, but also increases GPU memory pressure. In practice, test values like 4, 8, and 16 step by step until you find a stable point for your machine.

VAD and word-level timestamps

Speech-to-text often has to deal with long silence, background noise, and subtitle alignment. faster-whisper includes practical parameters that can be enabled directly during transcription.

Enable VAD:

1
segments, info = model.transcribe("audio.mp3", vad_filter=True)

Get word-level timestamps:

1
2
3
4
5
segments, info = model.transcribe("audio.mp3", word_timestamps=True)

for segment in segments:
    for word in segment.words:
        print("[%.2fs -> %.2fs] %s" % (word.start, word.end, word.word))

VAD is useful for meeting recordings, podcasts, and livestream replays that contain long silent sections. Word-level timestamps are useful for subtitles, transcript proofreading, and player-side word highlighting.

Choosing a model

Model choice mainly depends on accuracy, speed, and machine resources.

Scenario Recommendation
Quick testing small or medium
Chinese content with quality as the priority large-v3
Tight GPU memory int8_float16 or a smaller model
CPU background tasks Smaller model plus int8
Many short audio files Try BatchedInferencePipeline

For Chinese speech, start with large-v3 if quality matters. If the machine is under too much pressure, then lower the model size or use quantization. Do not look only at speed at the beginning; if transcription quality drops, the extra manual proofreading time may cancel out the inference time you saved.

Suitable use cases

faster-whisper is well suited for:

  1. Generating video subtitles.
  2. Transcribing podcasts, meetings, and course recordings.
  3. Building local transcription workflows for Bilibili, YouTube, and similar videos.
  4. Batch archiving and searching audio content.
  5. Feeding speech content into RAG, knowledge bases, or search systems.

It does not directly solve higher-level tasks such as speaker diarization, summarization, or chapter segmentation, but it can serve as a stable transcription layer. You can add pyannote for speaker diarization and an LLM for summarization and structured cleanup.

Deployment suggestions

For real use, debug in this order:

  1. Use a 1-to-3-minute audio clip to confirm the environment runs correctly.
  2. Test accuracy with samples that match your target language and audio quality.
  3. Check GPU memory usage before deciding whether to enable quantization.
  4. Split long audio first, so a failed task does not require rerunning everything.
  5. Save both TXT and SRT outputs to make later proofreading easier.

For server-side tasks, load the model during service startup instead of reloading it for every request. Model loading takes time, and frequent reloading can also make GPU memory management less stable.

Summary

The value of faster-whisper is that it turns Whisper into a transcription component better suited for long-term use. It is not a different model; it is a more efficient inference backend and engineering interface.

For personal workflows, it can quickly turn videos, meetings, and course audio into text. For server-side tasks, you can tune performance with GPU execution, quantization, batching, and VAD. As long as the machine environment is configured correctly, it is better suited than the original Whisper implementation for stable, batch speech-to-text work.

Project: https://github.com/SYSTRAN/faster-whisper

记录并分享
Built with Hugo
Theme Stack designed by Jimmy