Diarization and transcription

What if you have a speech recording with no transcript? We use pyannote for diarization (detecting when different people are talking in a recording) and vosk for transcription (creating a text on the basis of a speech recording). Automatic transcription is widespread, and a lot of different tools are available. We use these particular tools because they allow us to process our files locally (so that we don’t have to upload our possibly sensitive recordings to someone else’s server), they are free and open-source, they allow batch processing, and they are customizable.

There are two python scripts that we use for creating transcripts of speech recordings:

dt.py is designed to take a sound file, separate the speech by speaker, and transcribe it, producing a textgrid with a tier for each speaker. It is appropriate for transcribing interviews or anything else, even if you think there is only one speaker.
transcribe_wav.py is designed to take a sound file and transcribe it without considering the possibility that there are different speakers. It is appropriate for recordings of one speaker, or if you are having trouble installing pyannote or performing diarization.

These scripts are being used for English recordings, but they can be used for any language that has a language model available. See available language models here.

If you are using diarization, you will likely want to use computer #5. Otherwise, you can use any of the linux computers in the phonetics lab (in person or remotely).

To set up pyannote for the first time on a particular computer, enter these commands (you only need to do this once on each lab computer where you want to use pyannote for diarization):

pip3 install pyannote.audio
pip3 install vosk

We formerly needed to install these in a conda environment but we don’t need to do that as of February 2025.

Vosk and Whisper are both systems for transcribing audio. Whisper is installed like this:

pip3 install -U openai-whisper

Most likely you will want to used diarization along with transcription with dt.py. You should be on computer #5 for this.

If you ever get an error message that says “ModuleNotFoundError: No module named ‘torch'”, that means you probably haven’t activated pyannote. To diarize and transcribe a recording, use dt.py like this:

python /phon/vosk/dt.py --input yoursoundfile.wav

We can transcribe that interview excerpt we aligned above:

python /phon/vosk/dt.py --input ../files/english_data/interview_excerpt.wav

The first time you use dt.py, pyannote will download some more files that it needs. Then it will tell you when it’s starting diarization and when it’s starting transcription. You can watch it transcribe in real time.

You can expect an hour-long recording to take about 15 minutes to diarize and transcribe. Computer #5 has a GPU (graphics processing unit) which speeds up diarization. On another computer, it might take an hour or more to diarize and transcribe. During the transcription step, you will see the text that is being transcribed, as it is being transcribed. When it’s complete, the script will produce yoursoundfile_dt.TextGrid.

You may want to diarize and trancribe a whole batch of recordings. You can create a batch file and leave them running one after the other. We have a utility script to help make the batch file. Change directories into a folder containing wav files you want to transcribe. Then run this command:

python /phon/vosk/make_dt_batch_file.py

This will create a file called batch_file with a dt.py command for every wav file in your directory. Rename it to something descriptive and make it executable:

mv batch_file my_dt_batch_file
chmod +x my_dt_batch_file

If you are going to leave it running you’ll probably want to use screen so that it can keep running after you disconnect:

screen
./my_dt_batch_file

If you ever want to simply transcribe a recording in one textgrid tier, you can use transcribe_wav.py (This does not require pyannote, but it does require vosk). Do this to enable vosk outside of the pyannote environment:

pip install vosk

Then do this to run transcribe_wav.py.

python /phon/vosk/transcribe_wav.py --input yoursoundfile.wav

You can expect an hour-long recording to take about 15 minutes to transcribe. By default, this script will apply a limiter to the recording to compress the dynamic range (essentially making the quiet parts almost as loud as the loud parts). The files produced will be:

yoursoundfile_limited.wav: the limited sound file, in case you want to listen to it. You can delete it.
yoursoundfile_limited.TextGrid: the transcription

This helps the transcriber to detect interviewer speech, which is often relatively quiet. If you don’t want to use the limiter, you can specify that:

python /phon/vosk/transcribe_wav.py --input yoursoundfile.wav -- limiter 0

Improving transcription

After your recordings are transcribed, you will probably want to hand-correct them.

It is also possible to improve the transcription system by improving the language model. The language model includes information about what words sound like and what sequences of words are likely. If we have a lot of speech recordings, we can adapt the model to make the sequences of words in our recordings more probable (and more likely to be transcribed correctly).

Diarization and transcription on your own computer

To use diarization and transcription tools on another computer you will need to get vosk, pyannote, and the tools they depend on.

To use transcribe_wav.py, you will need to install python and vosk, and download transcribe_wav.py and alignbrary3.py and put them in the same folder. Download the language model vosk-model-en-us-0.22 from the vosk models page.

Install SoX (Sound eXchange), which is used for preprocessing sound files.

When running the script, you will need to tell it where you put the model, something like this:

python transcribe_wav.py --input yoursoundfile.wav --model path_to_model_on_your_computer

You can modify line 28 of your local copy of transcribe_wav.py to make your model path the default.

To diarize and transcribe on your own computer, install vosk like this (you only need to do this once on each lab computer where you want to use vosk for transcription):

pip install vosk

Install SoX as above, and install anaconda. Then follow the instructions above to install pyannote in anaconda. Download dt.py and alignbrary3.py and put them in the same folder.pyannote is made by an organization called HuggingFace. As of late 2022, it is necessary to register on their website in order to use a model required by pyannote. The first time you try using pyannote on another computer without registering, you will probably get a cryptic error message. You can search for that error message on the internet and get information about registration.

When running the diarization and transcription script you will need to specify the path to the language model (or edit your copy of the script to make it the default:

python dt.py --input yoursoundfile.wav --model path_to_model_on_your_computer