2 An Introduction to Data Manipulation in R
This practical is the first in a series of exercises which will introduce you to a range of techniques for handling, analysing and visualising spatial data in R. All of the data and software used are freely available as open data and open software.
“Open data is data that can be freely used, re-used and redistributed by anyone - subject only, at most, to the requirement to attribute and sharealike.” http://opendatahandbook.org/
Prior to working in R, we will first take you through obtaining the data from the Consumer Data Research Centre’s (CDRC) data portal which stores consumer-related data from a large number of sources. Consumer-related data are data generated by retailers and other service organisations as part of their business process. They can be used to monitor the needs, preferences and behaviours of customers. Census data in particular is useful to consumer insight, all leading retailers rely on accurate spatial data on population in order to guide the planning of their store locations and how their stock is distributed and marketed across the country.
The CDRC offers an open data service in which any member can register online and freely download the data which has been made available. The service can be accessed by visiting https://data.cdrc.ac.uk/
In this practical we will:
- Download a Census datapack from the CDRC Data website
- Load the data into R using RStudio
- View the raw data in R
- Subset data in R
- Merge data in R
2.1 Downloading data from the CDRC data website
On an internet browser go to https://data.cdrc.ac.uk/
In the top right of the screen you will see options to log in or register for an account. If you have not yet registered please sign up to an account now.
We are going to be interested in Census data for the borough of Camden. There are several routes to this dataset which include a search bar at the top of each page. However, we shall browse the website.
CDRC Data Statistics:
On the tab panel at the top of the page, click on Topics.
The data available on the CDRC data website can be broadly grouped into 11 key themes. Most of these pertain to the population and human activities. All of these themes are important to a wide range of industries, notably including retail. Census data can be found by clicking on the Demographics topic.
However, within this option there is still a very large number of datasets as the CDRC stores data on every district within the UK. In the search bar at the top of the page, enter ‘Camden’.
You now want to scroll down to CDRC 2011 Census Data Packs for Local Authority District: Camden (E09000007). It can also be found by clicking on the Census tab on the side of the screen.
The following page describes the content of the data and disclosure controls. The data is provided as a zipped folder of several different tables of Census data which encompass a wide variety of variables on the population. In addition, data is also provided at three different geographic scales of data units - Output Area, Lower Super Output Area and Middle Super Output Area.
To download the file you must click on Camden.zip under Data and Resources. If you have not already done so, here you will be asked to freely register as a new user.
It is recommended that you move the downloaded Camden.zip file to somewhere appropriate in your directory and then unzip the folder. To unzip on a windows computer simply right-click on the zipped file in windows explorer and click on “Extract All”.
The folder includes lots of useful data. The tables subfolder is where the Census data is stored in its various forms. However, each table is given a name which is not informative to us, so the datasets_description file is provided so we can lookup their names. In addition to this, a variables_description is provided so we can look up the variable name codes within each data table. GIS shapefiles hare also available from the shapefiles subfolder, these will be useful for mapping the data.
The Census data includes a large number of different datasets, all stored as Comma Separated Values (.csv) files. CSVs are a simple means of storing data so that it can be easily read and written on a computer. They are simply plain text documents where commas are used to separate the fields of data (you can observe this if you open a CSV in notepad).
For the forthcoming practicals we will be considering three variables, each from a different census dataset. These will be:
ColumnVariableCode | ColumnVariableDescription | DatasetId | DatasetTitle |
---|---|---|---|
KS201EW0020 | White: English/Welsh/Scottish/Northern Irish/British | KS201EW | Ethnic group |
KS403EW0012 | Occupancy rating (bedrooms) of -1 or less | KS403EW | Rooms, bedrooms and central heating |
KS501EW0019 | Highest level of qualification: Level 4 qualifications and above | KS501EW | Qualifications and students |
KS601EW0016 | Economically active: Employee: Full-time | KS601EW | Economic activity |
Level 4 qualifications refer to a Certificate of Higher Education Higher National Certificate (awarded by a degree-awarding body).
2.2 Loading data and data formatting in R
Our first step is to set the working directory. This is so R knows where to open and save files to. It is recommended that you set the working directory to an appropriate space in your computers work space. In this example it is the same folder as where the Census data pack has been stored.
To set the working directory, go to the Files table in the Files and Plots window in RStudio. If you click on this tab you can then navigate to the folder you wish to use. You can then click on the More button and then Set as Working Directory. You should then see some code similar to the below appear in the command line.
Alternatively, you can type in the address of the working directory manually using the setwd()
function as demonstrated below. This requires you to type in the full address of where your data are stored.
#Set the working directory. The bit between the "" needs to specify the path to the folder you wish to use (you will see my file path below).
setwd("../learningR/") # Note the single / (\\ will also work).
Our next steps are to load the data into R.
2.3 Loading data into R
One of R’s great strengths is its ability to load in data from almost any file format. Comma Separated Value (CSV) files are often a preferred choice for data due to their small file sizes and simplicity. We are going to open three different datasets from the Census database. Their codes and dataset names are written in the table in the previous page. We will be downloading Output Area level data, so only files with “oa11” included in their filenames.
We can read CSVs into R using the read.csv()
function as demonstrated below. This requires us to identify the file location within our workspace, and also assign an object name for our data in R.
# read.csv() loads a csv, remember to correctly input the file location within your working directory
ethnicity <- read.csv("worksheet_data/camden/KS201EW_oa11.csv")
rooms <- read.csv("worksheet_data/camden/KS403EW_oa11.csv")
qualifications <-read.csv("worksheet_data/camden/KS501EW_oa11.csv")
employment <-read.csv("worksheet_data/camden/KS601EW_oa11.csv")
2.4 Viewing data
With the data now loaded into RStudio, they can be observed in the objects window. Alternatively, you can open them with the View function as demonstrated below.
# to view the top 1000 cases of a data frame
View(employment)
All functions need a series of arguments to be passed to them in order to work. These arguments are typed within the brackets and typically comprise the name of the object (in the examples above its the DOB) that contains the data followed by some parameters. The exact parameters required are listed in the functions’ help files. To find the help file for the function type ?
followed by the function name, for example - ?View
There are two problems with the data. Firstly, the column headers are still codes and are therefore uninformative. Secondly, the data is split between three different data objects.
First let’s reduce the data. The Key Statistics tables in the CDRC Census data packs contain both counts and percentages. We will be working with the percentages as the populations of Output Areas are not identical across our sample.
2.5 Observing column names
To observe the column names for each dataset we can use a simple names()
function. It is also possible to work out their order in the columns from observing the results of this function.
# view column names of a dataframe
names(employment)
#> [1] "GeographyCode" "KS601EW0001" "KS601EW0002" "KS601EW0003"
#> [5] "KS601EW0004" "KS601EW0005" "KS601EW0006" "KS601EW0007"
#> [9] "KS601EW0008" "KS601EW0009" "KS601EW0010" "KS601EW0011"
#> [13] "KS601EW0012" "KS601EW0013" "KS601EW0014" "KS601EW0015"
#> [17] "KS601EW0016" "KS601EW0017" "KS601EW0018" "KS601EW0019"
#> [21] "KS601EW0020" "KS601EW0021" "KS601EW0022" "KS601EW0023"
#> [25] "KS601EW0024" "KS601EW0025" "KS601EW0026" "KS601EW0027"
#> [29] "KS601EW0028" "KS601EW0029"
From using the variables_description csv from our datapack, we know the Economically active: Unemployed percentage variable is recorded as KS601EW0017. This is the 19th column in the employment dataset.
2.6 Selecting columns
Next we will create new data objects which only include the columns we require. The new data objects will be given the same name as the original data, therefore overriding the bigger file in R. Using the variable_description csv to lookp the codes, we have isolated only the columns we are interested in. Remember we are downloading percentages, not raw counts.
# selecting specific columns only
# note this action overwrites the labels you made for the original data,
# so if you make a mistake you will need to reload the data into R
ethnicity <- ethnicity[, c(1, 21)]
rooms <- rooms[, c(1, 13)]
employment <- employment[, c(1, 20)]
qualifications <- qualifications[, c(1, 20)]
2.7 Renaming column headers
Next we want to change the names of the codes to ease our interpretation. We can do this using names()
.
If we wanted to change an individual column name we could follow the approach detailed below. In this example, we tell R that we are interested in setting the name Unemployed to the 2nd column header in the data.
# to change an individual column name
names(employment)[2] <- "Unemployed"
However, we want to name both column headers in all of our data. To do this we can enter the following code. The c()
function allows us to concatenate multiple values within one command.
# to change both column names
names(ethnicity)<- c("OA", "White_British")
names(rooms)<- c("OA", "Low_Occupancy")
names(employment)<- c("OA", "Unemployed")
names(qualifications)<- c("OA", "Qualification")
2.8 Joining data in R
We next want to combine the data into a single dataset. Joining two data frames together requires a common field, or column, between them. In this case it is the OA field. In this field each OA has a unique ID (or OA name), this IDs can be used to identify each OA between each of the datasets.
In R the merge()
function joins two datasets together and creates a new object. As we are seeking to join four datasets we need to undertake multiple steps as follows.
#1 Merge Ethnicity and Rooms to create a new object called "merged_data_1"
merged_data_1 <- merge(ethnicity, rooms, by="OA")
#2 Merge the "merged_data_1" object with Employment to create a new merged data object
merged_data_2 <- merge(merged_data_1, employment, by="OA")
#3 Merge the "merged_data_2" object with Qualifications to create a new data object
census_data <- merge(merged_data_2, qualifications, by="OA")
#4 Remove the "merged_data" objects as we won't need them anymore
rm(merged_data_1, merged_data_2)
Our newly formed census_data object contains all four variables.
2.9 Exporting Data
You can now save this file to your workspace folder. Remember R is case sensitive so take note of when object names are capitalised.
# Writes the data to a csv named "practical_data" in your file directory
write.csv(census_data, "worksheet_data/camden/practical_data.csv", row.names=F)