part 1


The vowel plot assignment is meant to show you speech recording, forced alignment, automatic acoustic measurement, plotting, and hand correction. More information about the vowel plot assignment can be found here.

Next we are going to go through some of these steps in a way that can be scaled up and customized for your research needs. Most of what we do can be run on a phonetics lab computer or on your own computer. It’s easier to get started on a phonetics lab computer (probably over a remote connection), and there you will have access to corpora that may be too large to store on your own computer or that you don’t have permission to download.

Tip: Keep a text file on your computer to keep records of commands so you can copy and paste them and learn them. You want to be able to replicate all the steps of things you’ve done once, while you’re still learning what all the steps mean. Using a text editor like TextEdit or Notepad can be better than Word because they are less likely to do things like replace quotation marks with fancy quotes. Fancier text editors exist, such as Kate (free) and Sublime Text ($99).

Connecting to a phonetics lab computer

If you are on campus with a wired connection or connected to the wireless network ncsu or eduroam (not ncsu-guest), connecting should work as described here. If you are off campus, you will need to connect to the VPN first. In all examples, replace “phon” in with the name of a specific lab computer (ask Jeff for names).

Open a terminal window (in Mac, look for Terminal in /Applications/Utilities or search for “Terminal”; in Windows 10 and up type “cmd” in the search field and click on “Command Prompt”; for older Windows versions, you’ll need to download a program called PuTTY: see the lab computers page for that)

To open the connection enter this command (remembering to replace yourunityid and phon) and then enter your unity id password:


You will be connected to a lab computer running the linux operating system and you interact with it by entering linux commands. You should see a command prompt that looks sort of like this: yourunityid@phon:~$

You will start out in your own home directory on the specific computer you connected to. all of these computers are connected to a shared hard drive called “phon”, which is full of data and scripts for working with data, and we will spend most of our time working on phon. It has an “ENG523” directory for us to put our files in. Change to that directory using the change directory (cd) command like this:

cd /phon/ENG523

You can always get back to your home directory on the computer you’re connected to like this (if you try this, change back to ENG523 before you do the next steps):

cd ~/

From the directory /phon/ENG523, create a directory for yourself like this using the mkdir (make directory) command (use your actual unity id) and then cd into it. Note that when you are already in /phon/ENG523, you can just type “cd yourunityid” but the following cd command will work from anywhere:

mkdir yourunityid
cd /phon/ENG523/yourunityid

Use the Penn Phonetics Lab Forced Aligner (P2FA) to replicate part of the vowel plot assignment

Compared to present-day alternatives, P2FA is a simple aligner. It expects one wav file and one text file in English, and it will write the result to a textgrid file whose name you specify. To align the test files that came with P2FA, do this:

python2 /phon/p2fa/ ../files/BREY00538.wav ../files/BREY00538.txt BREY00536.TextGrid

Copy the wav file for my vowel plot recording to your directory and align it too. You can align any arbitrary English wav and text file in this way.

cp ../files/jeffmielke2017.wav ./
python2 /phon/p2fa/ jeffmielke2017.wav ../files/OHDARE2.txt jeffmielke2017.TextGrid

The aligner will warn you that it doesn’t recognize one of the words. So you can copy a dictionary supplement.

cp ../files/vowelplot.dict ./dict.local

Then align again as above (you don’t need to copy the wav file again). Try using up arrows before retyping the whole python2 command.

Prepare the Montreal Forced Aligner

Compared to P2FA, the Montreal Forced Aligner has more features. It supports several languages with pretrained models and it allows you to train new models for any language. It is also easier to install on your own computer.

Do these commands the first time you want to use MFA on a *particular* lab computer. First install the aligner (say Y when prompted):

conda create -n aligner -c conda-forge montreal-forced-aligner

Set up the aligner for aligning English recordings (downloading acoustic models of English sounds, an English pronunciation dictionary, and loading our own English dictionary that includes additional words from our interviews):

conda activate aligner
mfa model download acoustic english_us_arpa
mfa model download dictionary english_us_arpa
mfa model save --name ral_mfa dictionary /phon/ENG523/files/ral_mfa_dict

Set up the aligner for Spanish:

conda activate aligner
mfa model download acoustic spanish_mfa 
mfa model download dictionary spanish_mfa

To see what other languages are supported, see the list of pretrained acoustic models and dictionaries.

This is how you exit conda:

conda deactivate

Remember that if you connect to a different phonetics lab computer in the future you will need to repeat these steps the first time you use MFA on it.

Use the Montreal Forced Aligner

P2FA expects one wav files and one txt file as input. MFA expects a directory with one or more wav files, each with a matching txt file (with all the words in it) or TextGrid file (which can have different tiers for different speakers and different intervals for different speaker turns). If you have interview1.wav and interview2.wav, you should also have either interview1.TextGrid and interview2.TextGrid or interview1.txt and interview2.txt. txt files are appropriate for short recordings of one speaker, and TextGrid files are better for longer recordings and necessary when you want to distinguish the speech of multiple speakers. TextGrid tier names will be interpreted as speaker labels. File names are just used to match up sound files with text files.

Start your aligner session:

conda activate aligner

Align Jeff’s vowelplot recording (using files in /phon/ENG523/files/jeff_vowelplot).

mfa align ../files/jeff_vowelplot ral_mfa english_us_arpa jeff_vowelplot_output

After it completes and says “Done! Everything took ??? seconds”, you may see the error message “psycopg2.OperationalError: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request.” This is not a problem as long as it happens after your files get created. Check whether your textgrid(s) have been created like this:

ls -l jeff_vowelplot_output

If you see alignment_analysis.csv and jeff.TextGrid, your alignment was successful.

Here is how to align another English recording (using files in /phon/ENG523/files/english_data):

mfa align --clean ../files/english_data ral_mfa english_us_arpa english_data_output

Align a sample Spanish speech recording (using files in /phon/ENG523/files/spanish_data):

mfa align --clean ../files/spanish_data spanish_mfa spanish_mfa spanish_data_output

Diarization and transcription

What if you have a speech recording with no transcript? We use pyannote for diarization (detecting when different people are talking in a recording) and vosk for transcription (creating a text on the basis of a speech recording). Automatic transcription is widespread, and a lot of different tools are available. We use these particular tools because they allow us to process our files locally (so that we don’t have to upload our possibly sensitive recordings to someone else’s server), they are free and open-source, they allow batch processing, and they are customizable.

There are two python scripts that we use for creating transcripts of speech recordings:

  • is designed to take a sound file, separate the speech by speaker, and transcribe it, producing a textgrid with a tier for each speaker. It is appropriate for transcribing interviews or anything else, even if you think there is only one speaker.
  • is designed to take a sound file and transcribe it without considering the possibility that there are different speakers. It is appropriate for recordings of one speaker, or if you are having trouble installing pyannote or performing diarization.

These scripts are being used for English recordings, but they can be used for any language that has a language model available. See available language models here.

If you are using diarization, you will likely want to use computer #5. Otherwise, you can use any of the linux computers in the phonetics lab (in person or remotely).

To set up pyannote for the first time, enter these commands (you only need to do this once on each lab computer where you want to use pyannote for diarization). Enter each line individually and press enter after each one. Some will ask you to confirm by typing “y”.:

conda create -n pyannote python=3.8
conda init bash
conda activate pyannote
conda install pytorch==1.11.0 torchvision==0.12.0 torchaudio==0.11.0 -c pytorch
pip install -qq
pip install vosk

These commands will set up pyannote to work in conda and install various python packages that it requires. Close the terminal and open a new one for these changes to take effect (log out of your ssh session, or close the terminal window).

Most likely you will want to used diarization along with transcription with You should be on computer #5 for this. Enter this command at the start of each session when you want to be able to use diarization:

conda activate pyannote

If you ever get an error message that says “ModuleNotFoundError: No module named ‘torch'”, that means you probably haven’t activated pyannote. To diarize and transcribe a recording, use like this:

python /phon/vosk/ --input yoursoundfile.wav

We can transcribe that interview excerpt we aligned above:

python /phon/vosk/ --input ../files/english_data/interview_excerpt.wav

The first time you use, pyannote will download some more files that it needs. Then it will tell you when it’s starting diarization and when it’s starting transcription. You can watch it transcribe in real time.

You can expect an hour-long recording to take about 15 minutes to diarize and transcribe. Computer #5 has a GPU (graphics processing unit) which speeds up diarization. On another computer, it might take an hour or more to diarize and transcribe. During the transcription step, you will see the text that is being transcribed, as it is being transcribed. When it’s complete, the script will produce yoursoundfile_dt.TextGrid.

You may want to diarize and trancribe a whole batch of recordings. You can create a batch file and leave them running one after the other. We have a utility script to help make the batch file. Change directories into a folder containing wav files you want to transcribe. Then run this command:

python /phon/vosk/

This will create a file called batch_file with a command for every wav file in your directory. Rename it to something descriptive and make it executable:

mv batch_file my_dt_batch_file
chmod +x my_dt_batch_file

If you are going to leave it running you’ll probably want to use screen so that it can keep running after you disconnect:


If you ever want to simply transcribe a recording in one textgrid tier, you can use (This does not require pyannote, but it does require vosk). Do this to enable vosk outside of the pyannote environment:

pip install vosk

Then do this to run

python /phon/vosk/ --input yoursoundfile.wav

You can expect an hour-long recording to take about 15 minutes to transcribe. By default, this script will apply a limiter to the recording to compress the dynamic range (essentially making the quiet parts almost as loud as the loud parts). The files produced will be:

  • yoursoundfile_limited.wav: the limited sound file, in case you want to listen to it. You can delete it.
  • yoursoundfile_limited.TextGrid: the transcription

This helps the transcriber to detect interviewer speech, which is often relatively quiet. If you don’t want to use the limiter, you can specify that:

python /phon/vosk/ --input yoursoundfile.wav -- limiter 0

Improving transcription

After your recordings are transcribed, you will probably want to hand-correct them.

It is also possible to improve the transcription system by improving the language model. The language model includes information about what words sound like and what sequences of words are likely. If we have a lot of speech recordings, we can adapt the model to make the sequences of words in our recordings more probable (and more likely to be transcribed correctly).

Make measurements using one_script

Use Praat  and one_script to measure the formants in that vowel plot recording you just aligned:

praat /phon/scripts/one_script.praat '/phon/ENG523/files/jeff_vowelplot/jeff.wav;/phon/ENG523/jimielke/jeff_vowelplot_output/jeff.TextGrid' 'VOWEL VL' 'formants()' 'l'

Measure spectral properties of sibilants instead of vowel formants:

praat /phon/scripts/one_script.praat '/phon/ENG523/files/jeff_vowelplot/jeff.wav;/phon/ENG523/jimielke/jeff_vowelplot_output/jeff.TextGrid' 'S SH' 'sibilant_jane()'

Measure sibilants in a fragment of the Buckeye Corpus instead of that one vowel plot recording:

praat /phon/scripts/one_script.praat '/phon/Buckeye/buckeye_test.csv' 'S SH' 'sibilant_jane()'

To list the files that are in your directory now, use the ls command:


To see some more information about these files do this:

ls -l

To see some more information about files we’ve been working with in the “files” directory, do this:

ls -l /phon/ENG523/files

Transfer files between a phonetics lab computer and your computer

When you create files remotely on the phonetics lab computer, you are limited in how you can interact with them. After you make measurements, you will probably want to download your files. To copy files between local and remote computers, open a NEW terminal window on your computer (the same way you did above). Importantly, do NOT use ssh to connect to the remote (phonetics lab) computer. You will be issuing these commands to YOUR OWN computer.

The command scp (secure copy) takes two arguments: where to find the file you want to copy and where to put. For downloads, we have to specify the remote computer and username in the first argument. To download the textgrid you created by aligning with P2FA, do this (with the usual changes):

scp ./

To download the textgrid you created by aligning with MFA, do the same thing but specify a different file location and filename:

scp ./

The “./” at the end tells it to put the file in your current directory. If you wanted to specify a different directory you can change “./” to something like “/Users/jimielke/Documents” “C:\Users\jimielke\Documents”.

Uploading works similarly.

<!– The scp command for uploading specifies the remote computer and username for the destination. This will work if your dict.txt is in your current directory in the terminal:

scp dict.txt

Note that in this example we are also renaming dict.txt to dict.local. If we left out the “dict.local” part at the end it would simply be uploaded as dict.txt. If dict.txt is not found, you can do things like change the source file from dict.txt to “/Users/jimielke/Documents/dict.txt” “C:\Users\jimielke\Documents\dict.txt” or else use the cd command to get into /Users/jimielke/Documents or C:\Users\jimielke\Documents. –>

The scp command for uploading specifies the remote computer and username for the destination. This will work if a file called myfile.wav is in your current directory in the terminal:

scp myfile.wav

If myfile.wav is not found but you think it should be, you can do things like change the source file from myfile.wav to “/Users/jimielke/Documents/myfile.wav” “C:\Users\jimielke\Documents\myfile.wav” or else use the cd command to get into /Users/jimielke/Documents or C:\Users\jimielke\Documents.

Putting it all together

In all of these examples, change “myfile” to the actual names of the files we are working with and yourunityid to your actual unity id. First paste these commands into your text file, and then edit them. Then you will have a repeatable set of commands to use in the future.

Upload a sound file to your ENG 523 directory by running this command in a terminal on your local computer:

scp myfile.wav

Now open an ssh connection to a phonetics lab computer and use that connection for the next steps:


Run these commands on the remote lab computer computer to diarize and transcribe the speech in the recording:

conda activate pyannote
python /phon/vosk/ --input myfile.wav

Organize the files for forced alignment. Here we are creating a directory, copying the wav file into it, and copying the textgrid file into it with a new name that matches the wav file’s name except for the .TextGrid extension. The input for forced alignment is a directory containing one or more wav files and one or more textgrid files with matching filenames.

mkdir mycorpus
cp F18_like_excerpt.wav mycorpus
cp F18_like_excerpt_dt.TextGrid mycorpus/myfile.TextGrid

Align the transcript to the speech recording:

conda activate aligner
mfa align --clean mycorpus ral_mfa english_us_arpa mycorpus_output

Use Praat  and one_script to measure the formants in the recording (note that the wav file is in the mycorpus directory and the textgrid is in the mycorpus output directory):

praat /phon/scripts/one_script.praat '/phon/ENG523/yourunityid/mycorpus/myfile.wav;/phon/ENG523/yourunityid/mycorpus_output/myfile.TextGrid' 'VOWEL VL' 'formants()' 'l'

Finally download the measurements to your computer by running this command in a terminal on your local computer (changing “date_and_time” to the actual date and time stamp in your measurement file’s name:

scp ./

These are all of the steps we can automate. When using these tools in a real project, you would make manual corrections. After diarization and transcription you would download the transcript textgrid, correct it, upload the corrected version, and use that as the input to forced alignment. You may or may not manually correct the forced aligner’s segmentation before making measurements. After measurement, you will want to plot the measurements and then you may or may not decide to modify the measurement settings, manually correct some measurements, or make corrections at earlier stages and rerun subsequent stages.

Continue to part 2.