R | NC State Phonetics Lab

R scripts (for data analysis and visualization)

R is a free software programming language and software environment for statistical computing and graphics which is highly extensible via packages. Like other language and softwares, it can be used to facilitate data manipulation, measurement, statistical analysis, and plotting. You can download R at http://www.r-project.org/ and documentation for base R and its packages (available via CRAN mirrors) are easily found on the web; a list of free and open-source references about R and statistical analysis in R is available here. See the ultrasound analysis section for instructions for using SSANOVA in R.

DATA STRUCTURES

The essential units of operation in R are called data structures of which the most basic is a vector. The vector is analogous to the list in Python and created by assigning any number of items to a single entity. The first line assigns numeric variables to x and the second line assigns a series of four string or character variables (in single or double quotes) to y. Other common object classes are logical, list, matrix, array, factor, and dataframe. To observe what the third line does, print the variable to the console by either entering z alone or using print(z):

x = c(10, 4, 16, 3)
y = c('one', 'two', 'three', 'four')
z = rep(1:10, each = 4)
print(z)

Notice that to assign variables to a vector, we use a function called concatenate or c() as well as the function rep(). Other useful simple functions include max(), min(), range(), sum(), length(), etc. To learn what a specific function does and how to use it, enter, for example, ?rep into the R console to pull up its documentation (if you are running R in a terminal, hit ‘q’ to return to the console). You can also assign the output of a function to a new variable, for example:

h = sum(x,z)
h
[1] 253

Commonly, you will have your data in a tab- or comma-delimited text or csv file (e.g., as the output from a Praat script). When you import this data into R, it will be in the form of a dataframe containing all your header columns as named vectors, with all the values that were in the rows of the spreadsheet as values within the vector. There are a number of ways to import files into your workspace.

First, you may want to change the directory you are working from to the location where you files are.

setwd('/home/megan/Documents/Data')
getwd()
[1] "/home/megan/Documents/Data"

Next, depending on if you’re using a tab-delimited file or a csv (comma-separated values) file, you’ll do one of the following to open a data file called, for example, ‘test.*’, and print a summary of the data:

mydata = read.table('test.txt', header = TRUE, sep = '\t')
mydata = read.csv('test.csv', header = TRUE)
summary(mydata)

If you don’t set your working directory beforehand, you can simply enter the full path to the file in the read.table or read.csv command. The example arguments above may be more or less verbose than what you need depending on the format of your data file (e.g., what your NA strings are), so it may be useful to read the documentation.

Within a dataframe, you can use the $ operator to reference columns (your variables), rows (observations), as well as index items in a particular row/column. As an example, say you imported a file containing formant measurements, with the headers Phone, F1, and F2, and named the dataframe “formants”. This type of referencing allows you to, for example, create subsets of your data:

formants$F1   # prints all observed values for this variable
levels(formants$phone)   # prints factor levels for this variable, e.g., [1] "AE1", "EH1", "IH1"
ae = subset(formants, phone == "AE1")   # creates a subset where phone == "AE1" is TRUE
ae = formants[formants$phone == "AE1",] # same as above
formants = formants[!is.na(formants)] # gets rid of NA values

Logical functions, operators which return either TRUE or FALSE, in R include:

== is identical to
!= is not identical to
& and
| or (this is called “pipe”)
< less than
> greater than
<= less than or equal to
>= greater than or equal to

For example, observe:

3 == 2
[1] FALSE
3 > 2
[2] TRUE

Example 2:

If your project consists of discovering the distance between vowel formants below is a helpful command to run in RStudio; simply replace the target vowels below with the target vowels that wish to analyze:

myspeakersummary = ddply(myphonesummary, .(speaker), summarize, mean_pitch=mean(mean_pitch, na.rm=TRUE), F1AA=mean(F1[phone==’AA1′], na.rm=TRUE), F1AO=mean(F1[phone==’AO1′], na.rm=TRUE), F2AA=mean(F2[phone==’AA1′], na.rm=TRUE), F2AO=mean(F2[phone==’AO1′], na.rm=TRUE), lowback_dist=sqrt((F1AA-F1AO)^2+(F2AA-F2AO)^2), F1=mean(F1, na.rm=TRUE), F2=mean(F2, na.rm=TRUE))

To merge Data:

yourfoldername<-merge(yourdocument, yourdocument, by.x = ‘__,’ by.y = ‘_’)

FOR LOOPS IN R

To create a for loop in R, the syntax is as below:

for(i in 1:10){
    print(i)
}

The counter variable is defined within the parentheses and the command to execute is within braces. In this example, the command print() would be repeated ten times, once for each value of the counter variable i. The whitespace is not obligatory as in Python, but can help with readability as can consistency. The output of this for loop looks like:

[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10

It is common to need to perform an operation on each level of a categorical variable, in which case your for loop may look something like the code below. In this example, the for loop creates a subset for each level of a variable called “phone” and collects the mean values in a vector:

means = c()
for(i in levels(data$phone)){
    sub_phone = subset(data, phone == i)
    means[i] = mean(sub_phone$f1)
}

Note that before the for loop, you need to create an empty vector named “means” in which we store the values returned by the operation. You can also do means = NULL. Because such for loops are commonly used, it is almost necessary that you have an ID variable of some kind in your dataframe which identifies unique observations. If, on the other hand, you want to perform an operation on each row of your data, a useful way to define your counter is:

for(i in length(data$variable){
    ...
}

OTHER LOOPS and STATEMENTS IN R

While loops in R take the following form:

while(i < 3)
    i = i +1
}

In this case, you need to already have a counter variable (here called “i”) defined before the while loop starts.

In if else statements, R checks to see if the condition in parentheses is true, then it performs the code inside the first set of brackets if the condition is true, and performs a different operation for the second set of brackets for which the condition is false. For example, if we want to perform separate operations on pre-nasal versus pre-l versus all other segments, we could use a for loop with if else statements that looks something like this:

for(i in 1:(length(data$phone)-1)){
    if(data$phone[i+1]%in%c('M','N','NG')){
        ...
    }else if(data$phone[i+1]=='L'){
        ...
    }else{
        ...
    }
}

In this example, the counter in the for loop ensures that R checks each row of the data and first looks for cases where data$phone[i+1] matches a nasal consonant using %in% syntax. The [i+1] is a way of looking ahead to the following context (e.g., only matching pre-nasal vowels). The “else if” then checks for pre-l environments, and the final “else” will match anything left.

One important thing to note is that the else statement must not be on a new line otherwise you will encounter an error. In general, if you encounter errors while writing for loops, it can be helpful to run the code contained within the loop replacing counter variables with actual known values. Additionally, make liberal use of the print() function in order to check values of variables as you are writing to ensure that you’re getting what you expect.

If you are constructing elaborate for loops that are computationally intensive, it can be a better idea to instantiate your variables using vectors of the correct size (rather than using variable = c() or variable = NULL). Do this by creating a vector of NA values or 0s. For example, you can do either of the following:

variable = numeric(length=50)
variable <- rep(NA, length(data$phone))

APPLY FUNCTIONS IN R

Apply functions are powerful tools built into R that can perform the same calculations as a large loop can, but with a fraction of the processing time and coding time. This link will take you to a decent explanation of how to use all the built-in apply functions and what kind of data can be processed with each one.

Of the apply functions, tapply tends to be the most useful for the kinds of datasets we work with as linguists studying sound among other things. The function tapply performs another function (e.g., mean, standard deviation, etc.) on a variable of your dataframe, given the conditions you provide. A list of built-in functions in R can be found here: http://www.statmethods.net/management/functions.html)

The syntax of tapply, as simply as possible, is this:

tapply(variable on which to perform function, grouping variable, function to perform)

Alternatively, you can think of it this way:

tapply(DV, IV, function)

As an example, two functions we’ll use are sum() and length(), the first of which adds things together, the second returns a number that indicates the length of the vector.

In this example, I’ll use the dataset “velar”, which has realizations of (IN) or (ING) for roughly 60 speakers. The data frame has a variable for speakers (called “Speaker”) and for (IN/ING) realization (called “Realization”), coded as “in” or “ing”. Using tapply, I want to get the percentage of (IN) used by each speaker. This would take a long time by hand, and would be more complicated with a for or while loop.

First, I want to know how many (IN) realizations each speaker had. To figure that out, I want to know how many (IN)s each speaker used total. So I use tapply:

i = tapply( (velar$Realization == “in”) , velar$Speaker, sum)

This returns a vector, organized by Speaker number, that counted how many (IN)s were coded for each Speaker. It summed all the instances of (IN), based on speaker. Next, I want to divide that by the total number of possible instances of (IN), to get the percentage:

j = tapply( velar$Realization, velar$Speaker, length)

“j” contains a vector that contains the length of the Realization vector for each speaker (essentially, the total number of instances of (ING) or (IN)). Now divide “i” by “j” to get the percent

percent = i / j

And now “percent” is a vector that contains percentage of (IN) use, organized by speaker number.

Download Cario for Mac

Cario is an R package that allows users to create PDFs using IPA symbols. Unfortunately, it’s not an automatic software on Mac, and can be tricky to download. Before you can use Cario, you will need to install ‘XQuartz’ which is an independent software that provides X11 capabilities for MacOS systems. Once this is installed, follow these steps:

Go to http://www.rforge.net/Cairo/files//. Download the appropriate version of the software for your computer under ‘Repository contents’ for the MacOS version. Note that you might have to troubleshoot by downloading various packages until you find the one that fits your laptop software.
Open ‘R Studio’ and select ‘Packages’ in the bottom right square.
Install > Downloads > select the ‘Cario.zip’ file you just downloaded.
If the ‘Install’ feature does not let you select a package downloaded onto your drive, make sure the drop down is set to ‘Package Archive File (.tgz; .tar.gz).’
Now you should have the option to select ‘Cario’ under the list of libraries under ‘System Libraries’ or type ‘library(Cario).’

Now you should be able to use Cario.

Good coding practices and tips for R and otherwise

1. USE TABS TO SHOW WHERE YOUR LOOPS START AND END.

Python requires it, but you should always do it. Code like this is a mess to read:

for(i in 1:10){

for(j in 1:10){

if(i ==2){

print(“1”)

}else

{

print(“2)

}}}

2. ALONG THE SAME LINES, USE COMMENTS EVERYWHERE!

No matter what you think, future you will not understand the code you just wrote. In 6 months you will have no idea what anything does, why you made a variable, what the loops do, or anything else. Comments can save you hours trying to figure out your own code down the line. As a corollary to that, commenting out code you don’t use is oftentimes better than deleting it wholesale.

3. THERE ARE 20 WAYS TO SOLVE THE PROBLEM YOU WANT TO CODE, AND EVERYBODY APPROACHES PROBLEMS DIFFERENTLY.

That said, there are probably 2-4 ways to solve your problem that are quick and efficient uses of your time, so if you find yourself coding something really repetitive, there’s probably an easier way.

4. NEVER WRITE A PROGRAM FROM SCRATCH UNLESS YOU REALLY REALLY HAVE TO.

Coding is about saving time, and you can usually borrow some code from another program rather than writing from the ground up.

5. IF YOU’RE DOING SOMETHING BASIC, SOMEBODY MIGHT HAVE ALREADY WRITTEN A PROGRAM TO DO IT.

Either here in the lab or on the internet, somebody has probably already wanted to do something very similar to what you’re doing. It’ll save you a bunch of time to search on the internet or ask a colleague rather than taking a day to write a program.

6. IF YOU’RE HAVING PROBLEMS, RUN THE PROGRAM PIECE BY PIECE AND PRINT YOUR VARIABLES.

Don’t run the whole program, run it bit by bit. If the first half works fine on its own, you can narrow your search down to the second half. It reduces the time you’ll spend searching (and searching, and searching, and searching…) your program for the error.

7. IT’S ALWAYS SOMETHING STUPID.

Errors that take you forever to find are always stupid mistakes. You forgot a bracket. You spelled “apply” wrong one time out of twenty. You used uppercase, not lowercase. Make sure to look for those things, and having somebody else look with you is a great way to fix those kinds of errors, since you’re oblivious to your own typos.

R script repository