Explore Advanced Speech-To-Text technology for unparalleled accuracy and blazing fast conversion rates.

Minyang Chen
9 min readJan 16, 2024

All sound is just disturbances in the air around you, after all, caused by vibrations. When you whisper you prevent the vocal cords from vibrating, so you speak (as all the actual talking sounds are created in your mouth) but do not use your voice.

Talking is your mouth making sounds combined with your vocal chords vibrating. Humming is your vocal chords vibrating without your mouth making sounds.

Every noise you make is called a phoneme, and every phoneme (consonant or vowel) has a theoretical voiced and unvoiced variant.

Your voice is created by your voice box, the vocal cords. When you talk those cords vibrate and create sound.

Photo by Kristina Flour on Unsplash


In creating PAVAI’s Voice User Interface, I prioritize accurate speech detection and transcription, filtering out background noises through extensive research and experimentation with diverse frameworks and technologies.

My goal is to achieve seamless real-time speech-to-text transcription, all while maintaining reliable offline functionality.

Before discsuss my evaluation result, I think it’s important to quickly discuss some techninogy, terms and products name.

Voice activity detection (VAD) for pre-processing, data cleaning and preparation, voice detection in general to reduces hallucination & batching with no WER degradation.

Speech to text (STT) Speech to text is a speech recognition software that enables the recognition and translation of spoken language into text through computational linguistics. Speech Recognition technology has seen increased usage in recent years for automating transcription of spoken language.

Automatic speech recognition (ASR) is a high-performance inference model of OpenAI’s Whisper

Whisper is an ASR model developed by OpenAI, trained on a large dataset of diverse audio. Whilst it does produces highly accurate transcriptions, the corresponding timestamps are at the utterance-level, not per word, and can be inaccurate by several seconds. OpenAI’s whisper does not natively support batching. It is trained on a large dataset of diverse audio and is also a multitasking model that can perform multilingual speech recognition, speech translation, and language identification.

Whisper is Transformer sequence-to-sequence model is trained on various speech processing tasks, including multilingual speech recognition, speech translation, spoken language identification, and voice activity detection.

Whisper.ccp is Port of OpenAI’s Whisper model in C/C++. Plain C/C++ implementation without dependencies.

Distilled-Whisper is a variant of Whisper for speech recognition Model created by Huggingface. 6x faster, 50% smaller, within 1% word error rate.

Faster-whisper is a reimplementation of OpenAI’s Whisper model using CTranslate2, which is a fast inference engine for Transformer models

Speculative Decoding for 2x Faster Whisper Inference. Speculative Decoding was proposed in Fast Inference from Transformers via Speculative Decoding by Yaniv Leviathan et. al. from Google. It works on the premise that a faster, assistant model very often generates the same tokens as a larger main model.

WhisperX is Automatic Speech Recognition with Word-level Timestamps (& Diarization), using faster-whisper as the backend.

WER evaluate speech-to-text system with similarity. JiWER is a simple and fast python package to evaluate an automatic speech recognition system. It supports the following measure:

  1. word error rate (WER)
  2. match error rate (MER)
  3. word information lost (WIL)
  4. word information preserved (WIP)
  5. character error rate (CER)

A quick example:

!pip install jiwer

from jiwer import wer
reference = "hello world"
hypothesis = "hello duck"

error = wer(reference, hypothesis)
# result 0.5

Test Result (Colab Notebook)


The testing method used here is straightforward, run 3 test audio files and collect the performance metrics.

Note: comparison in text accuracy is not included in the test notebook.

Step.1 Select and Prepare audio test files of difference sizes and format

### Small audio file - 
### 1 minute long in WAV format
# !wget https://huggingface.co/datasets/reach-vb/random-audios/resolve/main/ted_60.wav

### large audio files
### 2 hours long in MP3 format - investor conference audio file
# !wget https://huggingface.co/datasets/reach-vb/random-audios/resolve/main/4469669.mp3
### 2 hours 30 minutes long in FLAC format - Sam podcast
# !wget https://huggingface.co/datasets/reach-vb/random-audios/resolve/main/sam_altman_lex_podcast_367.flac

Note: Choosing 2 hours long audio file for speed test.

Step.2 Run Transcription on 3 audio test file

Whisper Base

The based Whisper library can be install pip install -U openai-whisper.However, I use Huggingface library as the baseline model .

Install required libraries

# !pip install -q --upgrade torch torchvision torchaudio
# !pip install -q git+https://github.com/huggingface/transformers
# !pip install -q accelerate optimum
# !pip install -q ipython-autotime
# !sudo apt install ffmpeg
%load_ext autotime

Run 3 tests

import torch
import optimum
import transformers
from transformers import pipeline

#Create Pipeline
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "openai/whisper-large-v3"
pipe = pipeline("automatic-speech-recognition",model_id,device=device)

# TEST-1 (1 minute file)
# WAV Format
outputs = pipe(audiofile1_60s,chunk_length_s=30)
## result
## So in college, I was a government major, which means I had to write
## a lot of papers. Now, when a normal student writes a paper,
## they might spread the work out a little like this. So, you know, you get started maybe a little slowly, but you get enough done in the first week that with some heavier days later on, everything gets done and things stay civil. And I would want to do that, like that. That would be the plan. I would have it all ready to go, but then actually the paper would come along, and then I would kind of do this. And that would happen to every single paper. But then came my 90-page senior thesis, a paper you're supposed to spend a year on. I knew for a paper like that, my normal workflow was not an option. It was way too big a project. So I planned things out, and I decided it kind of had to go something like this. This is how the year would go. So I'd start off light, and I'd bump it up
## time: 5.96 s (started: 2024-01-16 13:09:29 -05:00)

## TEST-2 (2 hour long investor conference call)
## MP3 Format
outputs = pipe(audiofile2_2hr07min,chunk_length_s=30)
## time: 9min 16s (started: 2024-01-16 13:11:14 -05:00)

## test-3 (2 hour and 30 minutes flac)
## FLAC Format
outputs = pipe(audiofile2_2hr30min,chunk_length_s=30)
## time: 13min 51s (started: 2024-01-16 13:27:39 -05:00)

Following careful examination, I am pleased to report that the transcribed text appears accurate and well-executed, reflecting commendable performance throughout the process. So these result will be use as the meaasurement baseline.

Distilled Whisper

It is a distilled version of the Whisper model that is 6 times faster, 49% smaller, and performs within 1% WER on out-of-distribution evaluation sets.

# library 
!pip install --upgrade pip
!pip install --upgrade transformers accelerate datasets[audio]

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "distil-whisper/distil-large-v2"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True

processor = AutoProcessor.from_pretrained(model_id)
pipe = pipeline(

Run tests

# test data

result = pipe(audiofile1_60s)
# time: 1.04 s (started: 2024-01-16 14:05:19 -05:00)

result = pipe(audiofile2_2hr07min)
# time: 1min 22s (started: 2024-01-16 14:05:32 -05:00)

result = pipe(audiofile2_2hr30min)
# time: 1min 49s (started: 2024-01-16 14:06:55 -05:00)

Observation: Incredible speed took only 1 min 22s for 2 hour long file

Speculative Decoding

Here lies an intriguing combination, whereby Distil-Whisper can serve as an auxiliary model to support speculative decoding alongside the primary Whisper model.

Crucially, speculative decoding guarantees identical outputs to those produced by Whisper but achieves this feat in half the time. Thus, users can rest assured that they will receive top-notch output quality from speculative decoding, all whilst benefiting from its remarkable accelerated processing capabilities.


%load_ext autotime
## Teacher Model
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor,pipeline
import torch

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "openai/whisper-large-v2"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True

processor = AutoProcessor.from_pretrained(model_id)
## Student Model (Assistant)
from transformers import AutoModelForCausalLM
assistant_model_id = "distil-whisper/distil-large-v2"

assistant_model = AutoModelForCausalLM.from_pretrained(
assistant_model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True

## Add student to teach pipeline
pipe = pipeline(
generate_kwargs={"assistant_model": assistant_model},

Now, let’s run the tests

result = pipe(audiofile1_60s)
# time: 4.06 s (started: 2024-01-16 14:23:21 -05:00)

result = pipe(audiofile2_2hr07min)
# time: 3min 59s (started: 2024-01-16 14:23:27 -05:00)

result = pipe(audiofile2_2hr30min)
# time: 5min 45s (started: 2024-01-16 14:27:26 -05:00)

After analyzing the results, it can be observed that speculative decoding demonstrated approximately twice the speed of the base Whisper model.

In particular, during Test-3, base Whisper consumed 13 minutes and 51 seconds, while speculative decoding finished the same task in merely 5 minutes and 45 seconds.

These impressive enhancements suggest that speculative decoding could provide a much quicker alternative for future applications without compromising on quality or reliability.


Port of OpenAI’s Whisper model in C/C++. High-performance inference of OpenAI’s Whisper automatic speech recognition (ASR) model. There is a Python binding, however I found a bit slower so use the native compilation for this test.

In order to run whisper.ccp we need compile the code native to the platform and GPU. see below.

!git clone https://github.com/ggerganov/whisper.cpp
%cd whisper.cpp

# Install g++ (C++ compiler)
!apt-get install g++

# Download the model
!bash ./models/download-ggml-model.sh large-v3

# make with GPU (using CUBLAS)
!make clean

next ensure the test data convert to mono channel WAV format

## conver to mono 
!ffmpeg -i ../ted_60.wav -acodec pcm_s16le -ar 16000 ted_60_2.wav -y
!ffmpeg -i ../sam_altman_lex_podcast_367.flac -acodec pcm_s16le -ar 16000 sam_altman_lex_podcast_367_2.wav -y
!ffmpeg -i ../4469669.mp3 -acodec pcm_s16le -ar 16000 4469669.wav -y

okay, we are now ready to run the tests

## TEST-1
!./main -m models/ggml-large-v3.bin -f ted_60_2.wav
# time: 5.05 s (started: 2024-01-16 14:58:35 -05:00)

## TEST-2
!./main -m models/ggml-large-v3.bin -f 4469669.wav
# time: 4min 27s (started: 2024-01-16 14:59:04 -05:00)

## TEST-3
!./main -m models/ggml-large-v3.bin -f sam_altman_lex_podcast_367_2.wav
#time: 8min 53s (started: 2024-01-16 15:03:31 -05:00)

Upon observation, it was found that whisper.cpp showed substantial performance improvements when compared to Whisker Base.

Specifically, during Test-3, Whisper Base required 13 minutes and 51 seconds, whereas whisper.cpp completed the same task in only 8 minutes and 53 seconds — representing a notable time savings of 36%.

This suggests that whisper.cpp may be more efficient and effective for similar tasks moving forward specially for LoT devices.



Ask Perplexity

Step.3 Summarised and Compare Test result

Short File: ted_60_2.wav (talk show)

  • Whisper base — 5.96 s
  • Faster-whisper — 4.25 s
  • Whisper.cpp — 5.05 s
  • Distilled-Whisper — 4.1 s
  • Speculative Decoding — 4.06 s

Long File-1: 2_2hr07min.mp3 (investor conference call)

  • Whisper base — 9 min 16s
  • Faster-whisper — 7 min 45s
  • Whisper.cpp — 4 min 27s
  • Distilled-Whisper — 1 min 22s
  • Speculative Decoding — 3 min 59s

Long File-2: 2_2hr30min.flac ( sam podcast)

  • Whisper base — 13 min 51s
  • Faster-whisper — 10 min 5s
  • Whisper.cpp — 8 min 53s
  • Distilled-Whisper — 1 min 49s
  • Speculative Decoding — 5 min 45s


The results demonstrate that Distilled-Whisper outperforms Whisper Base significantly in terms of speed, transcribing a 2-hour audio file in just 1 minute and 49 seconds compared to Whisper Base’s 13 minutes and 51 seconds. Speculative decoding implementation provides the optimal balance between speed and accuracy for transcription tasks, while Faster-Whisper offers improved Voice Activity Detection (VAD) implementation and falls in between the two in terms of speed.

Hope you find these result informative and interesting…

Have a nice day!


Credits and References:



Minyang Chen

Enthusiastic in AI, Cloud, Big Data and Software Engineering. Sharing insights from my own experiences.