How I Built a Cross-Platform Transcription Toolkit That Runs on My Mac and My PC
I had a problem. Over the past few years, I bought dozens of online courses. Hundreds of hours of video across business strategy, marketing, web development, and course creation. All stored on an external drive. All impossible to search.
I wanted to build a personal knowledge base where I could search across all that content semantically — ask a question and get relevant answers from any course I've ever taken. The first step was getting all those videos transcribed.
The challenge: I have two machines, and they're very different.
- Mac: Apple M2 Max, 32GB unified memory, Metal GPU
- Windows PC: Intel i7-3970K, NVIDIA GeForce GTX 1080 Ti (11GB VRAM), CUDA
I needed a transcription toolkit that could run on both, use the GPU on each, and process files in bulk with minimal babysitting.
Why Not Just Use a Service?
There are plenty of transcription services out there. Upload a video, get a transcript back. Some are even free for short clips.
But I had over 1,000 video files. At typical service pricing, that gets expensive fast. More importantly, I didn't want to upload hundreds of gigabytes of purchased course content to someone else's servers.
Running transcription locally solved both problems. No cost per file, no privacy concerns, and once the model is downloaded, it works offline.
Choosing the Engine
OpenAI's Whisper is the go-to open-source transcription model. There are two main ways to run it:
openai-whisper (the original): Built on PyTorch. Supports CUDA on NVIDIA GPUs. Does not support Metal/MPS on Apple Silicon. On a Mac, it falls back to CPU, which is painfully slow.
faster-whisper (CTranslate2): A reimplementation that's 4x faster and uses less memory. The key advantage for my use case: it automatically uses Metal on Apple Silicon, even when you don't explicitly configure it. On the Windows side, it supports CUDA just like the original.
I tested both. The original Whisper on Mac CPU processed one video in roughly 17 hours. faster-whisper on Mac Metal processed the same video in about 40 minutes. That's not a small difference. That's the difference between finishing this project and giving up.
faster-whisper was the obvious choice.
The Toolkit
I built a single Python script called transcribe.py that handles the whole pipeline:
- Scan a directory (and subdirectories) for video and audio files
- Process each file through faster-whisper
- Save the transcript as a plain text file next to the original
- Optionally delete the source file after transcription (to free up disk space)
- Log everything for review
The key design goal was portability. The same script runs on both machines with zero code changes. The only difference is the hardware it runs on, and faster-whisper handles that automatically.
Model Selection
Whisper comes in several sizes: tiny, base, small, medium, large, and large-v3. Larger models are more accurate but slower and use more memory.
I went with large-v3, the largest model available. It's slower and uses more memory, but these aren't polished studio recordings. A lot of the course content is group conference calls with multiple speakers, varying audio quality, and different accents. The smaller models choked on that. Large-v3 handles it reliably. It handles technical terminology well, which matters when you're transcribing courses about APIs, Docker, and instructional design.
The large-v3 model file is about 2.9GB. Download once, use forever.
File Discovery
The script walks through the target directory recursively. It recognizes common video formats (mp4, mkv, mov, avi, webm) and audio formats (mp3, wav, m4a, flac, ogg). For each file, it checks if a .transcript.txt already exists and skips it if so, making the process resumable.
This mattered a lot. With 1,000+ files, you don't want to restart from scratch every time something interrupts the run.
Prefix Handling
Course videos are usually organized in folders like:
courses/
Course Name 01/
Module 1/
01-Introduction.mp4
02-Getting Started.mp4
Course Name 02/
...
When you transcribe all these files into a flat folder or a vector database, you lose the folder structure context. You end up with hundreds of files named 01-Introduction.transcript.txt from different courses, with no way to tell them apart.
I added a --prefix-depth auto option that automatically detects how deep the course folders are and prepends the folder names to each transcript. So 01-Introduction.mp4 becomes Course Name 01_Module 1_01-Introduction.transcript.txt. Now every transcript is unique and identifiable.
The Windows Side
Setting up the Windows machine was more involved. The main challenge: CUDA support requires specific versions of Python, PyTorch (or CTranslate2), and NVIDIA drivers that all play nicely together.
Some things I learned the hard way:
Python 3.13 doesn't work. It's too new for the current PyTorch CUDA builds. Python 3.12 is the sweet spot.
NVIDIA's CUDA toolkit alone isn't enough. You also need the cuBLAS library. A simple pip install nvidia-cublas-cu12 fixes the "DLL not found" errors that would otherwise crash everything.
PATH matters. The CUDA DLLs need to be findable at runtime. Adding nvidia\cublas\bin to the system PATH in the batch script wrapper resolves this.
Once everything was configured, the GTX 1080 Ti (despite being a 2017 GPU) handled large-v3 without issue. 11GB of VRAM is plenty for this model.
The Mac Side
The Mac setup was trivial by comparison. Install faster-whisper, point it at the directory, and go. Metal acceleration is automatic. The M2 Max handled large-v3 easily — 32GB of unified memory means it never runs out of room.
The bottleneck on Mac was sheer quantity. With hundreds of files to process, even at 40 minutes each, it takes days. I let it run in the background and checked progress periodically.
What This Enabled
With all the transcripts generated, I loaded them into a vector database (Qdrant, running in a Docker container). Now I can search across every course I've ever taken using natural language:
"Show me everything about email marketing funnel optimization" "Find discussions about pricing online courses" "What do these courses say about SEO for membership sites?"
It's like having a personal research assistant that's read every course I own. When I'm writing blog posts, building course content, or solving a client problem, I can pull in relevant insights from my entire learning library.
For Anyone Considering This
If you're sitting on a library of video content that you can't search, local transcription is worth the setup effort. The tools are free, the models are good enough for most purposes, and once it's running, it's mostly hands-off.
The main investment is time. If you have hundreds of files, plan for this to run for days. Start with the files you need most. Let it work through the rest in the background.
And if you have both a Mac and a Windows machine with a GPU, run them in parallel. Split the files in half and let both machines work. That's what I did.
If you're building systems to organize and leverage your knowledge, or if you want to talk about automating parts of your course business, book a call.