Posts in software

Whisper deployed on Triton, LLMs coming

Whisper on Triton documentation

OpenAI Whisper is a tool for speech transcription. It works well and has potential applications in many different research and non-research use cases. Using it isn’t too hard - if you can install it and if you have a GPU. Often, the installing can become a big barrier, especially for “just testing”.

Luckily, we have a cluster with GPUs and a way to provide software for researchers. We’ve made Whisper available on the cluster as a module, so it’s trivial to use it for any audio data you may have. All one needs to do is:

It might look complicated, but all you need to do is copy and paste. The first words request the resources, the middle specifies your file, and the last are some standard options to make it do things like use our pre-downloaded model files. Yes - this still requires knowledge of how to use a cluster in general, but once you’ve got that knowledge, transcribing audio is trivial. We have a self-study course on cluster usage, and users can always drop by and ask us for help, for example our daily garage each day.

See the Whisper on Triton documentation for more information on the use.

We are also preparing a way to do this through the cluster web interface Open OnDemand, which will remove most of the need to know how a cluster works and make the tool even more accessible to other communities.

We hope to make other tools available like this.

Whisper is just one of the latest tools, but you’ve probably noticed that large language models are very popular these days. There are, in fact, some that can run locally on our own cluster, and our goal is to deploy more of these so that they can be easily tested and used. The intention isn’t to make a replacement for existing LLM services, but make internal for testing, research, and development use easier.

Local installs have various benefits, including lower cost (since we already own the hardware), being able to ensure reproducibility longer-term (since models are locally downloaded and preserved), and being able to use without various registrations. The downside is that the most popular ones ones aren’t available for local use.

Contact us if you need other models deployed, or if you have trouble using what’s already out there. We are still in an early phase, and there will probably be some difficulties in availability, accessibility, and reusability. Contact us early if you notice anything that’s not right. We both help installing things and help using them as a research engineer partner.

It’s clear that artificial intelligence and machine learning tools will become more critical tools for other research. The difficulty in deploying and using them could become a barrier, and that is where Aalto Scientific Computing comes in. It’s our goal to make sure the infrastructure that researchers need is ready and able to be used by everyone, not just those with classic HPC experience.

Here we go over some implementation details, which may help others who want to deploy similar things on their own clusters. If you just want to use things, you don’t need to read on.

We installed whisper in a container, so that all dependencies are packaged together and things are portable. The model definitions themselves are not included in the container, but mounted in. We try to find options that allow one to specify the model and model directory, so that the user can try out different models without downloading each one. The Lmod module file prints out some help when loaded.

We’ve got two versions installed: normal Whisper, and Whisper-diarization (which can identify speakers in the transcript).

Whisper and diarization both have multiple different implementations. It’s bit of guesswork to try to see which one is the easiest to get running / works the best (not about quality of transcript, but easy of deployment in container and with local models). This led to a change to another implementation of diarization midway since the current one is more active in development and seems overall slightly better. A lot of the work was fortunately transferable to the new implementation.

There were the common issues with getting the right dependencies in a container and getting the GPUs to work there. This is pretty standard by now.

Most implementations of whisper want to download models when running it. This might make sense for general user, but doesn’t really make sense on cluster. Depending on the implementation, getting it to use local models is not always trivial. Since GPU execution of diarization uses several models at once, there doesn’t seem to be a simple way to have it use local models at all without changing the code. It also required some sleuthing to find where exactly the models are downloaded. If a code uses Hugging Face, these environment variables can be useful.

Making a module that is both easy/practical to use for users without also losing options is usually bit tricky: we want users to be able to do anything, for “the right thing” to happen automatically, and not build some opaque framework to make it happen. Singularity-wrapper fortunately helps quite a bit in doing lot of background stuff such as binding directories, gpu flags, etc. cleanly without users having to care about it, while still giving the option to run the container straight through Apptainer/Singularity if finer control is necessary.

Testing if the containers work is somewhat annoying. Diarization in particular saves a lot of cache files all over the place, which all need to be purged when testing GPU running. Otherwise the GPU will stay idle since everything it would do is already in cache. This also affects clean-up after users run the code.

A minor inconvenience for us (but possibly large for users) is that the syntax for each Whisper CLI implementation tends to differ slightly. This makes swapping between implementations slightly annoying since you have to check every time what was the syntax for flags.

Read more ...