go-bif-examine: Which file contains that character's line?

Dec 30, 2023

Overview

When I was young (probably mid to late 2001) Baldur's Gate II was one of a handful of Dad's games I played while visiting his house. I wasn't very good at it, but the main music theme, combat music, the narrator telling me to gather my party, and Minsc ordering Boo to "go for the eyes" are all seared into my memory.

The goal of this project was initially to extract the audio assets from the game so I could find the character dialog files and post them at internet friends in Discord. I thought I'd just use the KEY and BIF file information documented by gibberlings3/iesdp and I'd just listen through all the files and determine which ones I wanted. However, given that there's ~17000 audio assets, that'd take a while and they're all named with vague names like MINSC01 or OH92885. I figured that since AI was the hot thing in the industry, that there had to be something available that I could run locally to determine what was said in each file. It turns out that openai/whisper does exactly what I want!

Architecture

The rapidly ballooning scope meant this turned into a full-blown project that'd take me a few weeks of nights and weekends, but would be a good demonstrator of what I can do with Go. I also decided I'd use it as an opportunity to learn gRPC and to do some fancy S3 things.

Things Explicitly out of scope

Since this is a personal project that's just for fun, there are several tasks that are out of scope. If this were for a work project or otherwise a revenue generating application, all of these things would be in scope.

Encryption between services.
Encrypted DB connection.
HTTPS for S3.
Using AWS S3 instead of minio.
- Using minio mostly so I don't have to pay any money for a hobby project. I also like self-hosting stuff when possible/reasonable.
Not using exec for the whisperer service.
- Since the interface uses gRPC and gRPC supports python, it'd be More Correct™ to rewrite it in python and call Whisper directly, but I don't want to write python..
Initial test coverage, TDD, etc.
- I'm learning how to use some technologies here. Trying to write tests for something when I have no idea how it works is kinda hard. I'm also doing this for fun, not work. I'm more interested in making progress instead of taking time to do it right.
- The KEY and BIF files will need some example files that aren't covered by someone else's copywrite.
- I'll eventually™ come back and write a lot of tests, but I'm not ready to do that.

Components

examine-fe
- Web frontend to act as the main interface to go-bif-examine.
- Shelved for now because gRPC between the frontend and backend wasn't as easy as I thought it would be and was requiring more effort to put but into somthing that wasn't Go.
go-bif-examine-cli.
- A cobra based CLI for interacting with go-bif-examine.
- Originally intended to just be a development tool, but turned into the main interface when I realized the web UI was going to be more effort than I thought.
- Uses gRPC to connect to go-bif-examine.
go-bif-examine.
- The main service:
  1. Manages the postgresql DB.
  2. Manages S3 objects.
  3. Parses received KEY and BIF files, stores metadata in DB, stores bif files in S3.
  4. When an instance of whisperer askes for a job, gets the metadata for an audio resource that hasn't yet been processed, generates a presigned GetObject URL, and responds.
  5. Stores results of whisperer.
- A monolith for simplicity; no need to break it out into microservices yet.
whisperer.
- The uncreatively named service that calls whisper.
- Requests job from go-bif-examine.
- Uses job information and presigned URL to only select the range of bytes from the bif file needed. This saves go-bif-examine from needing to save each resource as separate objects in S3.
- Can run multiple instances in parallel
postgresql
minio (for S3)

Results

I've found Whisper to be amazingly accurate on the majority of the files I've fed it. I can't share the audio for copywrite reasons, but this screencap of some of Minsc's lines has a few of the memorable ones. They should be easy enough to find on Youtube.

Listening to some of these files, it does very well even for characters with strong non-English accents. It doesn't always spell the character's name right, but it does a reasonably good job of phonetically capturing the sound.

However, I've found that in files where there isn't any dialog can often contain strange output; none of these files say "Thanks for watching!". Maybe the Whisper team used some Youtube videos in their training data? I'd need to read their paper to be sure, but to be honest, I'm not that interested in the nitty gritty details in how it works.

Whisper also gives some additional metadata, though I haven't found documentation describing it yet (mostly because I haven't tried that hard). I'm assuming the no_speech_prob field tells us the model's confidence of there not being any speech in the audio. Of the records shown in the last image, some of these are reasonably high, however some of them are kinda low.