1 An Introduction to R

R is a free software environment for statistical computing and graphics. It is extremely powerful and as such is now widely used for academic research as well as in the commercial sector. Unlike software such as Excel or SPSS, the user has to type commands to get it to execute tasks such as loading in a dataset or performing a calculation. The biggest advantage of this approach is that you can build up a document, or script, that provides a record of what you have done, which in turn enables the straightforward repetition of tasks. Graphics can be easily modified and tweaked by making slight changes to the script or by scrolling through past commands and making quick edits. Unfortunately command-line computing can also be off-putting at first. It is easy to make mistakes that aren’t always obvious to detect. Nevertheless, there are good reasons to stick with R. These include:

  • It’s broadly intuitive with a strong focus on publishable-quality graphics. It’s ‘intelligent’ and offers in-built good practice - it tends to stick to statistical conventions and present data in sensible ways.
  • It’s free, cross-platform, customisable and extendable with a whole swathe of libraries (‘add ons’) including those for discrete choice, multilevel and longitudinal regression, and mapping, spatial statistics, spatial regression and geostatistics.
  • It is well respected and used at the world’s largest technology companies (including Google, Microsoft and Facebook), the largest pharmaceutical companies (including Johnson & Johnson, Merck, and Pfizer), and at hundreds of other companies.
  • It offers a transferable skill that shows to potential employers experience both of statistics and of computing.

The intention with this worksheet is to provide a thorough introduction to R. It covers: 1. The basic programming principles behind R. 2. Loading in data from CSV files and subsetting it into smaller chunks. 3. Calculating a number of descriptive statistics for data exploration and checking. 4. Creating basic and more complex plots in order to visualise the distributions values within a dataset.

R has a steep learning curve, but the benefits of using it are well worth the effort. Take your time and think through every piece of code you type in. The best way to learn R is to take the basic code provided in tutorials and experiment with changing parameters - such as the colour of points in a graph - to really get “under the hood” of the software. Take lots of notes as you go along and if you are getting really frustrated take a break! To open R click on the start menu and open RStudio. You should see a screen resembling the image below (if it prompts you to update just ignore it for now).

R can be downloaded from https://www.r-project.org/ if it is not on your computer already. Although it is possible to conduct analysis on R directly, you may find it easier to run it via Rstudio which provides a user-friendly graphical user interface. After downloading R, Rstudio can be obtained for free from https://www.rstudio.com/

To open R click on the start menu and open RStudio. You should see a screen resembling the image below (if it prompts you to update just ignore it for now).

It is recommended that you enter your commands into the scripting window of RStudio and use this area as your work space. When you wish to run your commands either hold control Ctrl and enter on your key board for each line or select the line you wish to run and click Run at the top of the scripting window.

1.1 The Basics

This worksheet is a r markdown document. The example code will appear in boxes like the one below that appear similar to how you should enter them in your rstudio.

At its absolute simplest R is a calculator. If you type the addition below in the command line window, it will give you an answer (after every line you need to hit enter to execute the code).

5+10
#> [1] 15

However, it is often easier to assign numbers (or groups of them) a memorable name. These become objects in R and they are a really important concept. For example:

a<-5
b<-10

The <- symbol is used to assign the value to the name, in the above we assigned the integer 5 to the object a. To see what each object contains you can just type print(name of your object).

print(a)
#> [1] 5

Where the bit in the brackets is the object name. Objects can then be treated in the same way as the numbers they contain. For example:

a*b
#> [1] 50

Or even used to create new objects:

ab<- a*b
print(ab)
#> [1] 50

You can generate a list of objects that are currently active using the ls() command. R stores objects in your computer’s RAM so they can be processed quickly. Without saving (we will come onto this below) these objects will be lost if you close R (or it crashes).

To show the active R objects type:

ls()
#>  [1] "a"            "ab"           "b"            "Census.Data" 
#>  [5] "gwr.map"      "gwr.model"    "GWRbandwidth" "map.resids"  
#>  [9] "map1"         "map2"         "map3"         "map4"        
#> [13] "model"        "OA.Census"    "Output.Areas" "resids"      
#> [17] "results"

You may wish to delete an object. This can be done using rm() with the name of the object in the brackets. For example:

rm(ab)

Use the ls() command again to see if ab is no longer listed.

ls()
#>  [1] "a"            "b"            "Census.Data"  "gwr.map"     
#>  [5] "gwr.model"    "GWRbandwidth" "map.resids"   "map1"        
#>  [9] "map2"         "map3"         "map4"         "model"       
#> [13] "OA.Census"    "Output.Areas" "resids"       "results"

The real power of R comes when we can begin to execute functions on objects. Until now our objects have been extremely simple integers. Now we will build up more complex objects. In the first instance we will use the c() function. “c”" means concatenate and essentially groups things together.

DOB<- c(1993,1993,1994,1991)

Type print(DOB) to see the result.

We can now execute some statistical functions on this object

mean(DOB)
#> [1] 1993

median(DOB)
#> [1] 1993

range(DOB)
#> [1] 1991 1994

All functions need a series of arguments to be passed to them in order to work. These arguments are typed within the brackets and typically comprise the name of the object (in the examples above its the DOB) that contains the data followed by some parameters. The exact parameters required are listed in the functions help files. To find the help file for the function type() ? followed by the function name, for example: ?mean

All helpfiles will have a “Usage” heading detailing the parameters required. In the case of the mean you can see it simply says mean(x, ...). In function helpfiles x will always refer to the object the function requires and, in the case of the mean, the “…” refers to some optional arguments that we don’t need to worry about.

When you are new to R the help files can seem pretty impenetrable (because they often are!). Up until relatively recently these were all people had to go on, but in recent years R has really taken off and so there are plenty of places to find help and tips. Google is best tool to use. When people are having problems they tend to post examples of their code online and then the R community will correct it. One of the best ways to solve a problem is to paste their correct code into your R command line window and then gradually change it for your data an purposes.

The structure of the DOB object - essentially a group of numbers - is known as a vector object in R. To build more complex objects that, for example, resemble a spreadsheet with multiple columns of data, it is possible to create a class of objects known as a data frame. This is probably the most commonly used class of object in R. We can create one here by combining two vectors.

singers <- c("Zayn", "Liam", "Harry", "Louis")
one_direction <- data.frame(singers, DOB)

If you type print(one_direction) you will see our data frame.

print(one_direction)
#>   singers  DOB
#> 1    Zayn 1993
#> 2    Liam 1993
#> 3   Harry 1994
#> 4   Louis 1991

1.1.1 Tips

  1. R is case sensitive so you need to make sure that you capitalise everything correctly.
  2. The spaces between the words don’t matter but the positions of the commas and brackets do. Remember, if you find the prompt, >, is replaced with a + it is because the command is incomplete. If necessary, hit the escape (esc) key and try again.
  3. It is important to come up with good names for your objects. In the case of the One.Direction object I used a full-stop to separate the words as well as capitilisation. It is good practice to keep the object names as short as posssible so I could have gone for OneDirection or one.dir. You cannot start an object name with a number so 1D won’t work.
  4. If you press the up arrow in the command line you will be able to edit the previous lines of code you inputted.