Forced alignment

There are a few options for automatically aligning a transcript to your sound file. For each of them, you will need a wav file and a text file containing the transcript (in words). Most forced alignment systems are based on the HTK Speech Recognition Toolkit. HTK stands for Hidden Markov Model Toolkit. It was developed at Cambridge in the late 1980s. P2FA (the Penn Phonetics Lab Forced Aligner) is now a popular python-based interface to HTK. It uses HTK, the CMU Pronouncing Dictionary, and a set of acoustic models derived from a corpus of recordings of SCOTUS, the U.S. Supreme Court.

a web interface for the PENN PHONETICS LAB FORCED ALIGNER

If you don’t need to change anything about the alignment process (such as adding words to the dictionary or doing batches of files), your easiest option may be to use a local installation of P2FA. Where it says “text file (optional)” upload your transcript. Where it says “wav file”, upload your wav file. Then enter your e-mail address and click “Submit”. When the program is done (usually a few minutes), you will receive a TextGrid file at the e-mail address you entered. Currently the file limit for this public installation is ~ 20MB: around 16 minutes 16-bit audio sampled at 11025 Hz or 4 minutes sampled at 44100 Hz. All uploaded files are downsampled to 11025 Hz for processing, so consider downsampling before uploading to maximize file duration.

RUNNING P2FA ON THE PHONetics LAB computer

P2FA is installed on our server, so you can use it by logging in to a lab computer instead of using the web interface. Running P2FA this way will allow you to add custom dictionary entries or do multiple batches of alignments. Run the aligner by issuing a command like this:

python /phon/p2fa/align.py yourwavfile.wav yourtranscript.txt youroutput.TextGrid

The aligner may work for a long time, and when it’s done, the prompt will reappear and there should be a TextGrid that wasn’t there before. It is also worth remembering that it takes much longer or just doesn’t work if any of the files you are aligning are open.

If you think an alignment may take longer than you will be at your computer, you can can use the command screen. Another way to handle this is to tell the server to keep processing the files even if you log off the server. Any completed text grids will be available in your directory the next time you log on, just type your command like this instead:

nohup python /phon/p2fa/align.py yourwavfile.wav yourtranscript.txt youroutput.TextGrid

If you want to align part of a wav file (for example, only the second minute of your recording), you can type enter the start time and end time in seconds like this:

python /phon/p2fa/align.py -s 60 -e 120 yourwavfile.wav yourtranscript.txt youroutput.TextGrid

To see instructions, you can simply type this:

python /phon/p2fa/align.py

ADDING WORDS TO THE DICTIONARY

A major advantage of using the aligner directly (instead of through the web interface) is being able to modify the dictionary. You can see the 100 lines of the CMU dictionary that P2FA uses by typing this:

head -100 /phon/p2fa/model/dict

You can create a file in your home directory (or wherever you align files) called dict.local. Whenever you run the aligner the contents of this file will be appended to the regular dictionary. You can create it on your own computer and then upload it, or you can create it directly on the server like this:

vim dict.local

Follow the formatting of the original dict file. The aligner is very picky about formatting. Entries must be in all capitals, and each word must be followed by two spaces and then all of its phonetic symbols separated by one space. The phonetic symbols are as follows:

English consonants: [p]=P, [t]=T, [k]=K, [b]=B, [d]=D, [ɡ]=G, [tʃ]=CH, [dʒ]=JH, [f]=F, [θ]=TH, [s]=S, [ʃ]=SH, [h]=HH, [v]=V, [d]=DH, [z]=Z, [ʒ]=ZH, [m]=M, [n]=N, [ŋ]=NG, [l]=L, [w]=W, [ɹ]=R, [j]=Y

English primary stressed vowels: [i]=IY1, [u]=UW1, [ɪ]=IH1, [ʊ]=UH1, [e]=EY1, [o]=OW1, [ʌ]=AH1, [ɛ]=EH1, [ɚ]=ER1, [ɔ]=AO1, [æ]=AE1, [ɑ]=AA1, [aɪ]=AY1, [aʊ]=AW1, [ɔɪ]=OY1

English secondary stressed vowels: [i]=IY2, etc.

English unstressed vowels: [i]=IY0, [ə]=AH0, etc.

The dictionary also contains the following noises, which you can put in your transcript ad needed (the curly brackets are important): cough={CG}, laugh={LG}, lip smack={LS}, noise={NS}, silence={SL}

Troubleshooting

If you made your audio recording in Ultraspeech, there is a good chance that you have a 32-bit wave file, and when you run align.py or align_interview_turns.py, you will get an error ending in wave.Error: unknown format: 65534.

It’s easy to make a 16-bit version of your wav file:

sox yourwavefile.wav -b 16 yourwavefile2.wav

Then run the aligner using yourwavefile2.wav instead of yourwavefile.wav. You can continue to analyze your audio using either file.

ALIGNING INTERVIEWS USING p2FA

The basic align.py script for P2FA creates a textgrid with one words tier and one phone tier, so it doesn’t distinguish between different speakers. To process interviews, we made a python script that reads turns from a textgrid, aligns each turn separately, and makes a textgrid file with separate tiers for each speaker.

python /phon/p2fa/align_interview_turns.py yourwavfile.wav yourtranscript.TextGrid youroutput.TextGrid

ALIGNING Spanish RECORDINGS

In addition to P2FA for English, we have a FASE, a Spanish forced aligner created by Eric Wilbanks (formerly of NCSU). Running FASE is similar to running P2FA (but note the -w, -t, and -o tmp. You can use a text file (*.txt) with speaker tags in curly brackets or a textgrid with different speakers’ turns segmented on different tiers.

python /phon/fase/fase_align.py -w yourwavfile.wav -t yourtranscript.txt -o tmp
python /phon/fase/fase_align.py -w yourwavfile.wav -t yourtranscript.TextGrid -o tmp

FASE must create a temporary directory (called “tmp” in this example) which it deletes when it’s finished. FASE assumes the directory does not already exist, so if it does, you need to delete it first:

rm -rf tmp

This is a very powerful command (to delete a directory and its contents without asking you to confirm individual files), so use it very carefully.

Troubleshooting

There is a good chance that FASE will tell you that your transcript has words that are missing from the dictionary. To see the list of missing words, do this:

cat tmp/missing_words

Then make a text file with one line for each missing word, with its phonetic transcription. If the content of missing_words was ANANÁS, make a text file with one line:

ANANÁS a n a n a s

If you save your dictionry text file as dict.local, you will now want to delete tmp (as above) and then run the aligner with the -m option pointing to your dictionary, e.g.:

python /phon/fase/fase_align.py -w yourwavfile.wav -t yourtranscript.TextGrid -o tmp -m dict.local

If you made your audio recording in Ultraspeech, there is a good chance that you have a 32-bit wave file, and when you run fase_align, you will get an error ending in wave.Error: unknown format: 65534.

It’s easy to make a 16-bit version of your wav file:

sox yourwavefile.wav -b 16 yourwavefile2.wav

Then run the aligner using yourwavefile2.wav instead of yourwavefile.wav. You can continue to analyze your audio using either file.

ALIGNING FRENCH RECORDINGS

For French, we have SPLaligner, a forced aligner created by Peter Milne of the University of Ottawa. Using SPLaligner is very similar to using P2FA (but the dictionary uses IPA symbols):

python /phon/SPLaligner/align.py yourwavfile.wav yourtranscript.txt youroutput.TextGrid yourdictionary

ALIGNING OR REALIGNING IN PRAAT

We have a Praat script that is useful for touching up existing textgrids: /phon/scripts/editor_align.praat. It requires P2FA or another aligner to be installed on the computer where you are using it. Therefore you can use it on the lab computers that run Linux. The script is an editor script, so it needs to be opened from inside Praat’s Editor window.