Back to blog

I Accidentally Deleted 25,000 Records From My Vector Database. Here's How I Recovered.

4 min readTools & Automation

I've been building a personal knowledge base. The idea is simple: transcribe over a thousand course videos, chunk the text, and load everything into a vector database for semantic search.

The setup uses Qdrant running in a Docker container on my Mac. The upload script chunks each transcript into roughly 500 token segments with a 50-token overlap. Each chunk stores metadata including the filename, file path, and chunk index.

Vector databases store text as numerical embeddings. You search them with natural language questions, not exact text match. Ask "how do you handle objections in sales calls" and it returns relevant chunks even if those exact words never appear in the source material.

Everything was working well. The upload was processing files. Around 400 files into a batch of 9,879, I decided to run a deduplication script to clean things up.

The Mistake

The dedup script checked for duplicate filenames. If two records had the same filename, it kept one and deleted the other.

Sounds reasonable. Except I have dozens of courses, and they all have files named "01-Introduction.mp4."

Same filename, completely different course content. One by one, the dedup script removed them.

The damage report: 25,666 records deleted. Most were not duplicates at all. They were unique course transcripts that happened to share common filenames with other courses.

Gone.

How I Caught It

I noticed the collection size had dropped dramatically. What was supposed to be growing was shrinking. I checked the dedup script's logs and saw it was removing records by filename alone, with no path check.

A file called "01-Introduction.mp4" in a copywriting course and a file called "01-Introduction.mp4" in an SEO course are completely different content. The script treated them as the same file.

The Recovery

I had a full backup export from before the dedup ran. That backup contained 3,846 points at 29.2 MB. It wasn't everything (the upload was still in progress), but it was enough to start over without losing progress.

I wiped the collection completely and began re-uploading from scratch.

The re-upload is still running. With nearly 10,000 files to process, it takes time. But every record going in is clean, verified, and won't be touched by a dedup script again.

Three Rules I Learned

Always dedup by a compound key. Filename plus path. Never filename alone. Two files with the same name in different directories are almost certainly different content.

Always backup before bulk operations. No exceptions. I ran the dedup against my live collection without a recent backup. If I hadn't exported the collection the week before, the loss would have been permanent.

Test on a small sample first. I should have run the dedup script against 50 records and manually verified the results before letting it loose on the entire collection. Ten minutes of testing would have revealed the filename-only logic flaw.

The Broader Lesson

Vector databases are powerful but unforgiving. There's no undo button. No transaction log. No way to roll back a bulk delete. Once records are removed, they're gone.

This is different from relational databases where you can wrap operations in transactions and roll back if something goes wrong. Vector databases are optimized for search performance, not for data safety.

Bulk operations against any database deserve caution. Run them against a test collection first. Verify the results by hand. Only then apply to production.

The backup strategy saved me here. I got lucky. The next time I run a bulk operation against my knowledge base, I won't rely on luck.

If you're building systems to organize and leverage your knowledge, or if you want to talk about automating parts of your course business, book a call.

Share

More writing