Query word count information
On the phonetics lab computers, there are some files listing all the words in various speech corpora by frequency. You can explore lexical frequency in order to see how many tokens are available for you to measure and to gain familiarity with some useful linux commands. For example, /phon/ENG523/files/raleigh_freq.txt has the frequency of words in the transcribed part of the Raleigh corpus, and /phon/ENG523/files/raleigh_freq2.txt has those words with their pronunciations (with separate entries for words transcribed with different phonemes).
Use cat to view the entire contents of raleigh_freq.txt:
Use head to just see the 20 most frequent words in raleigh_freq.txt:
head -20 /phon/ENG523/files/raleigh_freq.txt
Use grep to search for words with “Q” in them:
grep 'Q' /phon/ENG523/files/raleigh_freq.txt
Use grep, head, and “pipe” to see the top 15 words with “Q” in them:
grep 'Q' /phon/ENG523/files/raleigh_freq.txt | head -15
Switch to the freq2 files to search for the 20 most frequent words pronounced with /s t ɹ/:
grep 'S T R' /phon/ENG523/files/raleigh_freq2.txt | head -10
What freq2 files are present:
The Raleigh corpus was aligned with P2FA, and its words are in all caps. The other SLAAP corpora were aligned with MFA, and their words are in all lowercase. Since the Buckeye corpus was aligned in the early 2000s and it has a different phone set. These things need to be taken into account when searching. Use head and cat if your grep command output doesn’t make sense to you. Use the grep option -i to ignore case. Both of these will work for searching the NC corpus frequency file:
grep 'raleigh' /phon/ENG523/files/nc_freq2.txt grep -i 'RALEIGH' /phon/ENG523/files/nc_freq2.txt
Run a Praat Multiple Forced Choice (MFC) script
A useful thing to do in Praat is to auditorily code some speech clips and save your results to a file you can analyze in R or a spreadsheet program. The following examples have been created from the Raleigh corpus using a Praat script called one_script. Below you will see how to make your own MFC scripts.
Example 1: auditorily coding ing
- Download and unzip ing_mfc_example.zip
- In Praat, open the file one_script_out_ing_MFC.praat. This script is designed to play you all 98 occurrences of the words “buying” and “housing” in a 162-speaker sample of the Raleigh corpus.
- Select the script, click “Run”. For each sound you hear, click “ING” or “IN” depending on whether you hear a velar nasals or an alveolar nasal at the end of the word. Continue to follow the prompts to the end, and then close the MFC window when you are instructed to. While still selecting the ExperimentMFC object in the object list, click “Extract results”. While still selecting the ResultsMFC object, click “Collect to table”. While still selecting Table allResults, click “Save”, then “Save as comma-separated file…”. Change the default “allResults.Table” filename to ing_mfc_results.csv.
- Analyze the results in R using the commands in ing_mfc.r.
Example 2a: auditorily coding stop releases without looking at spectrograms
- Download and unzip final_d_mfc.zip and final_t_mfc.zip
- In Praat, click Open… Read from file… and browse to one_script_out_2022Mar07_14h30_MFC.praat (or 14h31 depending on whether you are doing /t/s or /d/s).
- Select the script, click “Run”, follow the prompts to the end. Use the NA category any time there is a problem that prevents you from coding the target word properly (like if it sounds like the wrong sound or two people are talking at the same time).
- Close the MFC window when you are instructed to. While still selecting the ExperimentMFC object in the object list, click “Extract results”. While still selecting the ResultsMFC object, click “Collect to table”. While still selecting Table allResults, click “Save”, then “Save as comma-separated file…”. Change the default “allResults.Table” filename to something more descriptive. For example you can call it one_script_out_2022Mar07_14h30_MFC_MFC_results_yourname.csv.
Example 2b: auditorily coding stop releases while looking at spectrograms
- Download and unzip final_d_mfc.zip and final_t_mfc.zip (if you didn’t already do it above).
- In Praat, click Praat… Open Praat script… browse for one_script_out_2022Mar07_14h30_editor_mfc.praat (or 14h31 depending on whether you are doing /t/s or /d/s). It will open in a new Praat script window.
- Press Ctrl/Cmd-R to run it. After you get through all the clips, the results will be saved automatically.
- Look for a file in the same folder as the script that starts with editor_mfc_results and rename it so that it has your name at the end.
Make your own Praat MFC script using one_script
- Choose a word that you expect to exhibit some variation in Raleigh. For example, I once listened to every occurrence of “both” in the Raleigh corpus to listen for intrusive L (i.e., [boɫθ]) and found out that intrusive L does not apparently occur here. But if I listened to every token of “home” I would probably find that some of them sound to me like [hoɫm]. Ideally you would choose a word that occurs at least 100 times according to raleigh_files_freq2.txt. Connect to a lab computer and explore that file to find out. Decide on two or more terms to describe the variants you expect to find, such as “intrusive L” and “no intrusive L”.
- Run a one_script query to create a new MFC script for your word and your labeling categories. While connected to a lab computer, change to your directory and make a one_script query like the example below which is what I used for my “both” query. The first argument is the list of all the Raleigh corpus files, the second argument is the phonemes in your word (with # to indicate word boundaries and / between phones), and the third argument says to make an MFC script with the specified categories (note the underscores in place of spaces here):
praat /phon/scripts/one_script.praat /phon/Raleigh/raleigh_files_sub.csv '#/B/OW1/TH/#' 'mfc(categories="intrusive_L/no_intrusive_L")'
- When the script finishes running, it will say something along the lines of “MFC script /phon/ENG523/jimielke/one_script_out_2022Apr04_13h29_MFC.praat has been created”. Note the date/time stamp in this output and use it to zip up the files like this (which says to zip recursively, i.e., including subfolders anything that has the timestamp in the filename and call the resulting file my_MFC.zip):
zip -r my_MFC *2022Apr04_13h29*
- Follow the procedure described in part 1 to download my_MFC.zip and then work with it as in the previous examples.
Auditorily code your sound clips while seeing spectrograms
A regular Praat MFC script will play your sounds but not show you the spectrograms. To see the spectrograms as in example 2b, we need to make the MFC script into a different kind of script by running a python script that specifies the time stamp of your regular MFC script:
python2 /phon/scripts/make_editor_mfc.py --timestamp 2022Apr04_13h29
After doing this, you can zip (or rezip) your files as above and download. Run the editor mfc script as above in example 2b (using the Praat menu instead of the Open menu).
Making acoustic measurements with one_script
one_script is a big Praat measurement script originally created by Jeff Mielke and Eric Wilbanks. It is designed to do the work that most measurement scripts are created to do, which is to read one or more textgrid files, find segments of interest, and measure the corresponding sound intervals in some way. The user specifies which files (or which entire corpora) to measure, which segments to measure in which environments, and which measurement or extraction procedures to apply to the target segments.
In part 1, we used Praat to measure the formants from one wav file and its textgrid.
praat /phon/scripts/one_script.praat '/phon/ENG523/files/jeff_vowelplot/jeff.wav;/phon/ENG523/jimielke/jeff_vowelplot_output/jeff.TextGrid' 'VOWEL VL' 'formants()' 'l'
We can also point it to a file that lists multiple wav/textgrid pairs (in this example, it’s the list of all aligned Raleigh Corpus interviews that are available for measurement):
praat /phon/scripts/one_script.praat '/phon/Raleigh/raleigh_files_sub.csv' 'VOWEL VL' 'formants()' 'l'
Each one_script query has three require arguments: (1) the files to be measured, (2) the segment string to be measured in those files, and (3) the measurement procedures to be applied. Some additional options can be specified after those three arguments. In the previous query, /phon/Raleigh/raleigh_files_sub.csv is the file that lists the interview recordings to measure, ‘VOWEL VL’ specifies that all vowels and vowel-liquid sequences should be measured, and ‘formants()’ says to measure the formants with default settings. The ‘l’ at the end is an extra option that tells the script to treat vowel-liquid sequences as a single segment.
Specifying what files to measure
Here are examples of how to direct one_script to various files and sets of files (In all cases it’s necessary to specify the full path to the files, because these paths will be interpreted by a praat script that is running in a different directory):
- ‘/phon/ENG523/files/jeff_vowelplot/jeff.wav;/phon/ENG523/jimielke/jeff_vowelplot_output/jeff.TextGrid’ tells it to measure one wav file (jeff.wav) and its textgrid.
- /phon/Raleigh/ral_files_mfa.csv: the Raleigh corpus (Raleigh, NC, 235 speakers)
- /phon/ENG536/files/ohio_files.csv: Ohio recordings from SLAAP (various sites in Ohio, 69 speakers)
- /phon/ENG536/files/northtown_files.csv: North Town recordings from SLAAP (one town in South Texas, 43 speakers)
- /phon/ENG536/files/exslave_files.csv: Ex-Slave recordings from the Library of Congress (various places in the U.S., 10 speakers)
- /phon/Buckeye/buckeye_files.csv: The Buckeye corpus (Columbus, OH; 40 speakers; phonetically transcribed, so nasalized vowels, flaps, etc have their own transcription symbols, which needs to be taken into account when comparing this corpus to others)
Ask about other corpora. We have a some other corpora collected by other people that may be made available, and we have collections of speech (including lab speech with acoustic and articulatory data available) from previous projects on English and various understudied languages.
If you want to measure your own set of files with one_script, edit one of these csv files in a spreadsheet program like Excel to replace the old corpus information with your corpus information. The main purpose of this file is to list the path to the wav and textgrid files, and sometimes to provide additional information about the speakers.
Specifying what speech sounds to measure
For most of our English corpora, phonemes are transcribed using arpabet symbols. You can study the dictionary files like /phon/ENG523/files/slaap_dict to familiarize yourself with arpabet symbols. To measure one phoneme in all contexts, just list the phoneme in single quotes:
- ‘IH1’ = all primary stressed /ɪ/
To measure two phonemes in all contexts, list them with a space between them in single quotes:
- ‘S SH’ =/s ʃ/
There are some wildcards that will match more than one segment. These are listed below. For example:
- ‘/VOWEL/’ = all vowels
To specify the context, separate it from the target sounds with slashes.The target segment(s) are the ones in the middle. Note the difference between the two /h/+vowel examples:
- ‘HH/VOWEL/D’ = vowels between /h/ and /d/
‘HH/VOWEL/’ = measure all vowels after /h/
‘/HH/VOWEL’ = measure all /h/ before a vowels
There should always be an even number of slashes (0, 2, or 4) because there needs to be a unique middle (target) position, but you don’ have to put anything between all the slashes. Use # for word boundaries.
- ‘//S/T/R’ = /s/ in /stɹ/ clusters
- ‘/#/S/T/R’ = word-initial /s/ in /stɹ/ clusters
- ‘VOWEL/#/S/T/R’ = word-initial /s/ in /stɹ/ clusters after vowel-final words
- ‘/#/S/P T K/R’ = word-initial /s/ in /spɹ/, /stɹ/, /skɹ/ clusters
- ‘/#/S/STOP/R’ = word-initial /s/ in /s/+stop+/ɹ/ clusters
Here are the available wildcard symbols that match one or more English phone. The Buckeye corpus lowercase symbols and phonetic variants are included automatically
- VOWEL = TNS LAX
- TENSE = TNS = IY1 IY2 IY0 EY1 EY2 EY0 EYR1 EYR2 EYR0 OW1 OW2 OW0 UW1 UW2 UW0
- LAX = IH1 IH2 IH0 IR1 IR2 IR0 EH1 EH2 EH0 AH1 AH2 AH0 AE1 AE2 AE0 AY1 AY2 AY0 AW1 AW2 AW0 AA1 AA2 AA0 AAR1 AAR2 AAR0 AO1 AO2 AO0 OY1 OY2 OY0 OR1 OR2 OR0 ER1 ER2 ER0 UH1 UH2 UH0 UR1 UR2 UR0
- HIGH = HI = IY1 IY2 IY0 IH1 IH2 IH0 IR1 IR2 IR0 UH1 UH2 UH0 UW1 UW2 UW0 UR1 UR2 UR0
- MID = EY1 EY2 EY0 EYR1 EYR2 EYR0 EH1 EH2 EH0 AH1 AH2 AH0 AO1 AO2 AO0 OW1 OW2 OW0 OY1 OY2 OY0 OR1 OR2 OR0 ER1 ER2 ER0
- LOW = AE1 AE2 AE0 AY1 AY2 AY0 AW1 AW2 AW0 AA1 AA2 AA0 AAR1 AAR2 AAR0
- BACK = BCK = AA1 AA2 AA0 AAR1 AAR2 AAR0 AO1 AO2 AO0 OW1 OW2 OW0 OY1 OY2 OY0 OR1 OR2 OR0 UH1 UH2 UH0 UW1 UW2 UW0 UR1 UR2 UR0
- CENTRAL = CNT = AH1 AH2 AH0 AY1 AY2 AY0 AW1 AW2 AW0 ER1 ER2 ER0
- FRONT = FRO = IY1 IY2 IY0 IH1 IH2 IH0 IR1 IR2 IR0 EY1 EY2 EY0 EYR1 EYR2 EYR0 EH1 EH2 EH0 AE1 AE2 AE0
- UNROUNDED = UNR = IY1 IY2 IY0 IH1 IH2 IH0 IR1 IR2 IR0 EY1 EY2 EY0 EYR1 EYR2 EYR0 EH1 EH2 EH0 AH1 AH2 AH0 AE1 AE2 AE0 AY1 AY2 AY0 AA1 AA2 AA0 AAR1 AAR2 AAR0
- ROUNDED = ROUND = RND = AW1 AW2 AW0 AO1 AO2 AO0 OW1 OW2 OW0 OY1 OY2 OY0 OR1 OR2 OR0 ER1 ER2 ER0 UH1 UH2 UH0 UW1 UW2 UW0 UR1 UR2 UR0
- DIPHTHONG = DIPH = AY1 AY2 AY0 AW1 AW2 AW0 OY1 OY2 OY0
- STRESSED = STR = IY1 IH1 IR1 EY1 EYR1 EH1 AH1 AE1 AY1 AW1 AA1 AAR1 AO1 OW1 OY1 OR1 ER1 UH1 UW1 UR1
- SECONDARY = 2ND = IY2 IH2 IR2 EY2 EYR2 EH2 AH2 AE2 AY2 AW2 AA2 AAR2 AO2 OW2 OY2 OR2 ER2 UH2 UW2 UR2
- UNSTRESSED = UNS = IY0 IH0 IR0 EY0 EYR0 EH0 AH0 AE0 AY0 AW0 AA0 AAR0 AO0 OW0 OY0 OR0 ER0 UH0 UW0 UR0
- VL = IY1L IY2L IY0L EY1L EY2L EY0L OW1L OW2L OW0L UW1L UW2L UW0L IH1L IH2L IH0L EH1L EH2L EH0L AH1L AH2L AH0L AE1L AE2L AE0L AY1L AY2L AY0L AW1L AW2L AW0L AA1L AA2L AA0L AO1L AO2L AO0L OY1L OY2L OY0L UH1L UH2L UH0L (note that these transcriptions are probably only present when you use the ‘l’ option to merge vowel+liquid sequences)
- VR = IY1R IY2R IY0R EY1R EY2R EY0R OW1R OW2R OW0R UW1R UW2R UW0R IH1R IH2R IH0R EH1R EH2R EH0R AH1R AH2R AH0R AE1R AE2R AE0R AY1R AY2R AY0R AW1R AW2R AW0R AA1R AA2R AA0R AO1R AO2R AO0R OY1R OY2R OY0R UH1R UH2R UH0R (note that these transcriptions are probably only present when you use the ‘l’ option to merge vowel+liquid sequences)
Consonants by manner and voicing
- VOICELESS = VLS = P T K CH F TH S SH HH
- VOICED = VOI = B D G JH V DH Z ZH
- SONORANT = SON = M N NG L W R Y
- CONSONANT = CONS = VLS VOI SON
- OBSTRUENT = OBS = VLS VOI
- STOP = P T K B D G
- AFFRICATE = AFFR = CH JH
- FRICATIVE = FRIC = F TH S SH HH V DH Z ZH
- SIBILANT = SIB = CH S SH JH Z ZH
- NASAL = NAS = M N NG
- APPROXIMANT = APPROX = APPR = GLI LIQ
- LIQUID = LIQ = L R
- GLIDE = GLI = W Y
Consonants by place
- LABIAL = LAB = BILAB LABDENT W
- BILABIAL = BILAB = P B M
- LABIODENTAL = LABDENT = F V
- CORONAL = COR = DENT ALV POSTALV PAL LIQ
- DENTAL = DENT = TH DH
- ALVEOLAR = ALV = T D S Z N
- PALATAL = PAL = Y
- POSTALVEOLAR = PLV = CH JH SH ZH
- DORSAL = DOR = VEL W
- VELAR = VEL = K G NG
- GLOTTAL = LARYNGEAL = LAR = HH
Other languages with wildcards set up
For working with languages other than English, you may want to add your own wildcards to the script. You can always just list the consonants and vowels that you want to search for in your textgrids without one_script understanding what they refer to (for example, if you didn’t have ‘VOCALES’ as an option, you could use ‘i u e o a’):
- VOCALES = Spanish vowels (IPA)
- CONSONANTES = Spanish consonants (mostly IPA)
- VOYELLES = French vowels (IPA)
- CONSONNES = French consonants (IPA)
- KALASHA_VOWEL = Kalasha vowels (IPA)
- KALASHA_CONS = Kalasha consonants (IPA)
- BORA_VOWEL = Bora vowels (IPA)
- BORA_CONS = Bora consonants (IPA)
- NATQGU_VOWEL = Natqgu vowels (IPA)
- NATQGU_CONS = Natqgu consonants (IPA)
Specifying what measurement procedures to apply
When you run one_script.praat, it calls a file called something like one_script_procedures_24.praat, which contains all the measurement procedures. We can add new measurement procedures to this file. Many procedures have already been defined. Depending on what you want to do, you may need to talk to me about using the existing procedures or making a new one.
Here are some procedures useful for acoustic measurements:
- duration(): simply measure segment duration
- formants(): measure formant frequencies
- cog(): simply measure center of gravity
- cog_jim(): measure center of gravity of middle 60% after filtering 750-11025 Hz
- cog_pro(): measure center of gravity at many time points with band filtering
- intensity(): measure intensity at many time points
- pvi(): make duration measurements and calculate Pairwise Variability
- harmonicity_rise: Find when harmonicity rises fastest (an indicator of the onset of
- sibilant_jane: measure spectral peak, spectral slope, center of gravity, and spectral spread
- energy: measure the energy in a particular frequency band
- harmonicity: measure the ratio of harmonics to noise (i.e., how vowel-like vs. how
fricative-like an interval is)
These procedures are useful for listening to instances of the thing you are measuring:
- clips(): extract word clips
- trigrams(): extract three-word clips
- mfc(): extract word clips and make a multiple forced-choice experiment script
These procedures are useful for other practical purposes:
- frames(): extract video frames using avconv
- tier3(): resegment manually on tier 3
Multiple procedures can be called in a single query, like this:
praat /phon/scripts/one_script.praat '/phon/Raleigh/raleigh_files_sub.csv' 'VOWEL VL' 'duration(),formants()' 'l'
While logged in to a phonetics lab computer, you can see what version of one_script we’re currently using like this:
Then you can list all the procedure names in the appropriate procedures file like this:
grep '^procedure' '/phon/scripts/one_script_procedures_24.praat'
Most procedures have some documentation right before the start of the procedure. You can view it like this:
grep -B 15 'procedure formants (' /phon/scripts/one_script_procedures_24.praat
The vowel formant measurement procedure
The procedure ‘formants()’ is probably the most popular and the most complicated. It will use a procedure modeled after FAVE (Labov, Rosenfelder, and Fruehwald 2013) to measure formants several different ways (holding ceiling frequency constant and varying the number of poles/formants) and choose the result that is most similar to a prototype for that vowel phoneme in that language or dialect, if prototypes are available. By default, it will output F1, F2, and F3 frequencies at 0%, 25%, 50%, 75%, and 100% of the vowel’s duration, but many options are available. Here are some examples:
- ‘formants()’ = default settings (F1, F2, and F3 frequencies measured at 0%, 25%, 50%, 75%, and 100% of the vowel’s duration using Raleigh English prototypes)
- formants(output_formants=5,measurements=11′ = F1-F5 frequencies neasured at 11 evenly-spaced time points (0%, 10%, 20%, …, 100%) using Raleigh English prototypes
- ‘formants(language=Spanish,bandwidths=1,amplitudes=1)’ = F1, F2, and F3 frequencies, bandwidths, and amplitudes at 0%, 25%, 50%, 75%, and 100% using CERD Spanish prototypes
Some additional options can be optionally specified after the procedure argument:
- ‘w’ = ignore word boundaries
- ‘d’ = exclude segments below a minimum duration threshold of 50 ms
- ‘l’ = merge postvocalic liquids with vowels
- ‘f’ = exclude English function words
- ‘wdlf’ = ignore word boundaries, exclude segments below a minimum duration threshold of 50 ms, merge postvocalic liquids with vowels, and exclude English function words
Sometimes it is desirable to exclude certain words, for example we are working with laboratory speech that was produced in a carrier phrase and we do not want to measure the carrier phrase words. This is specified after the ‘wdlf’ options above, so if none of those are used, there needs to be two single quotes after the procedure argument to keep the words to exclude from being parsed as if they were ‘wdlf’ options. Here is a hypothetical example for an experiment using the carrier phrase “Please say [target word] again.”:
praat /phon/scripts/one_script.praat '/phon/ENG523/my_lab_corpus.csv' 'VOWEL' 'formants()' '' 'please say again'
Every one_script query should produce one one_script_out file and one one_script_log file with the same time stamp. Some types of queries (like MFC script creation) will have additional outputs. The “log” file records information about what kind of query you ran, and the “out” file has one row for each measured token, and columns with information about each token. Every one_script_out file should start with the same 18 columns to help you interpret the measurements and link them with other important information:
- 1 speaker is a code for the speaker (can be used to link to demographic information)
- 2-3 textgrid and sound refer to the files used to make the measurements
- 4 phonetier indicates what tier of the textgrid was the speaker’s phone tier
- 5-6 word_id and token_id are unique identifiers for the word and phone measured in each row (can be used to link to other phonetic measurements or auditory coding results)
- 8 word is the word (can be linked to lexical information such as frequency or neighborhood
- 7,9 leftword and rightword list the preceding and following word tier intervals (words, pauses,
- 10 phone is the phone that was measured (can be linked to phonological information such as place, manner, and voicing)
- 11-12 phonestart and phoneend are the start and end times of the word interval (can be used to measure duration after the fact or interpret measurements taken at a particular time point)
- 15,14,13 left, left1, and left2 are the preceding phone interval, the phone interval before that, and the phone interval before that (can be linked to phonological information such as place, manner, and voicing of the context where the measured phone is occurring)
- 16,17,18 right, right1, right2 are the same as left, etc. but for following phone intervals
The remaining columns will depend on what procedure(s) you ran. For example, formants() will add these additional columns:
- total_formants and max_formant describe the measurement parameters of the selected measurements
- mdist, sos, energy, rsq1, rsq2, and rsq3 are several measures of how good the formant measurements are
- F1_0, F1_25, F1_50, F1_75, and F1_100 are measurements of F1 at five time points
- F2_0, F2_25, F2_50, F2_75, and F2_100 are measurements of F2 at five time points
- F3_0, F3_25, F3_50, F3_75, and F3_100 are measurements of F3 at five time points