31 * 78
[1] 2418
697 / 41
[1] 17
This document provides worked answers for all of the exercises in the Introduction to R with Tidyverse course.
We start by doing some simple calculations.
31 * 78
[1] 2418
697 / 41
[1] 17
We next look at how to assign data to named variables and then use those variables in calculations.
We make assignments using arrows and they can point to the right or the left depending on the ordering of our data and variable name.
<- 39
x
<- 22
y 22 -> y # This is equivalent to the line above
We can then use these in calculations instead of re-entering the data
- y x
[1] 17
We can also save the results directly into a new variable.
<- x - y
z
z
[1] 17
We want to calculate the square root of 2345, and perform a log2 transformation on the result.
This could be done in two steps, by creating an intermediate variable.
<- sqrt(2345)
sqrt_value log2(sqrt_value)
[1] 5.597686
Or we can nest the functions.
log2(sqrt(2345))
[1] 5.597686
We can also store text. The difference with text is that we need to indicate to R that this isn’t something it should try to understand. We do this by putting it into quotes.
If we forget the quotes then R will look for a variable called simon
and will produce an error when it can’t find it.
# do not copy this code - it will produce an error!
<- simon
my_name # Error: object 'simon' not found
It works nicely when we include the quotes.
<- "simon" my_name
We can use the nchar()
function to find out how many characters my_name
has in it.
nchar(my_name)
[1] 5
We can also use the substr()
function to get the first letter of my_name
.
substr(my_name, start = 1, stop = 1)
[1] "s"
We’re going to manually make some vectors to work with. For the first one there is no pattern to the numbers so we’re going to make it completely manually with the c()
function.
<- c(2, 5, 8, 12, 16) some_numbers
For the second one we’re making an integer series, so we can use the colon notation to enter this more quickly.
<- 5:9 number_range
Now we can do some maths using the two vectors.
- number_range some_numbers
[1] -3 -1 1 4 7
Because the two vectors are the same size then the equivalent positions are matched together. Thus the final answer is:
(2-5), (5-6), (8-7), (12-8), (16-9)
We’re going to use some functions to create vectors.
First we’re going to make a numerical sequence with the seq()
function.
seq()
to make a vector of 100 values starting at 2 and increasing by 3 each time and store it in a new variable called number_series
<- seq(from = 2, by = 3, length.out = 100)
number_series
number_series
[1] 2 5 8 11 14 17 20 23 26 29 32 35 38 41 44 47 50 53
[19] 56 59 62 65 68 71 74 77 80 83 86 89 92 95 98 101 104 107
[37] 110 113 116 119 122 125 128 131 134 137 140 143 146 149 152 155 158 161
[55] 164 167 170 173 176 179 182 185 188 191 194 197 200 203 206 209 212 215
[73] 218 221 224 227 230 233 236 239 242 245 248 251 254 257 260 263 266 269
[91] 272 275 278 281 284 287 290 293 296 299
number_series
by 1000 and overwrite the original variable.<- number_series * 1000
number_series
number_series
[1] 2000 5000 8000 11000 14000 17000 20000 23000 26000 29000
[11] 32000 35000 38000 41000 44000 47000 50000 53000 56000 59000
[21] 62000 65000 68000 71000 74000 77000 80000 83000 86000 89000
[31] 92000 95000 98000 101000 104000 107000 110000 113000 116000 119000
[41] 122000 125000 128000 131000 134000 137000 140000 143000 146000 149000
[51] 152000 155000 158000 161000 164000 167000 170000 173000 176000 179000
[61] 182000 185000 188000 191000 194000 197000 200000 203000 206000 209000
[71] 212000 215000 218000 221000 224000 227000 230000 233000 236000 239000
[81] 242000 245000 248000 251000 254000 257000 260000 263000 266000 269000
[91] 272000 275000 278000 281000 284000 287000 290000 293000 296000 299000
rep()
to make a vector of 100 values containing 25 each of WT, KO1, KO2 and KO3We can create a vector containing a single WT, KO1, KO2 and KO3 and check that that works first.
c("WT", "KO1", "KO2", "KO3")
[1] "WT" "KO1" "KO2" "KO3"
Then we can include that within the rep()
function. We’ll get different results depending on whether we use the times
or each
option.
rep(c("WT", "KO1", "KO2", "KO3"), times = 25)
[1] "WT" "KO1" "KO2" "KO3" "WT" "KO1" "KO2" "KO3" "WT" "KO1" "KO2" "KO3"
[13] "WT" "KO1" "KO2" "KO3" "WT" "KO1" "KO2" "KO3" "WT" "KO1" "KO2" "KO3"
[25] "WT" "KO1" "KO2" "KO3" "WT" "KO1" "KO2" "KO3" "WT" "KO1" "KO2" "KO3"
[37] "WT" "KO1" "KO2" "KO3" "WT" "KO1" "KO2" "KO3" "WT" "KO1" "KO2" "KO3"
[49] "WT" "KO1" "KO2" "KO3" "WT" "KO1" "KO2" "KO3" "WT" "KO1" "KO2" "KO3"
[61] "WT" "KO1" "KO2" "KO3" "WT" "KO1" "KO2" "KO3" "WT" "KO1" "KO2" "KO3"
[73] "WT" "KO1" "KO2" "KO3" "WT" "KO1" "KO2" "KO3" "WT" "KO1" "KO2" "KO3"
[85] "WT" "KO1" "KO2" "KO3" "WT" "KO1" "KO2" "KO3" "WT" "KO1" "KO2" "KO3"
[97] "WT" "KO1" "KO2" "KO3"
rep(c("WT", "KO1", "KO2", "KO3"), each = 25)
[1] "WT" "WT" "WT" "WT" "WT" "WT" "WT" "WT" "WT" "WT" "WT" "WT"
[13] "WT" "WT" "WT" "WT" "WT" "WT" "WT" "WT" "WT" "WT" "WT" "WT"
[25] "WT" "KO1" "KO1" "KO1" "KO1" "KO1" "KO1" "KO1" "KO1" "KO1" "KO1" "KO1"
[37] "KO1" "KO1" "KO1" "KO1" "KO1" "KO1" "KO1" "KO1" "KO1" "KO1" "KO1" "KO1"
[49] "KO1" "KO1" "KO2" "KO2" "KO2" "KO2" "KO2" "KO2" "KO2" "KO2" "KO2" "KO2"
[61] "KO2" "KO2" "KO2" "KO2" "KO2" "KO2" "KO2" "KO2" "KO2" "KO2" "KO2" "KO2"
[73] "KO2" "KO2" "KO2" "KO3" "KO3" "KO3" "KO3" "KO3" "KO3" "KO3" "KO3" "KO3"
[85] "KO3" "KO3" "KO3" "KO3" "KO3" "KO3" "KO3" "KO3" "KO3" "KO3" "KO3" "KO3"
[97] "KO3" "KO3" "KO3" "KO3"
Since R is a language built around data manipulation and statistics we can use some of the built in statistical functions.
We can use rnorm
to generate a sampled set of values from a normal distribution
<- rnorm(20) normal_numbers
Note that if you run this multiple times you’ll get slightly different results.
We can now use the t.test
function to test whether this vector of numbers has a mean which is significantly different from zero.
t.test(normal_numbers)
One Sample t-test
data: normal_numbers
t = -0.21539, df = 19, p-value = 0.8318
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
-0.6123510 0.4980798
sample estimates:
mean of x
-0.05713561
Not surprisingly, it isn’t significantly different.
If we do the same thing again but this time use a distribution with a mean of 1 we should see a difference in the statistical results we get.
t.test(rnorm(20, mean=1))
One Sample t-test
data: rnorm(20, mean = 1)
t = 4.6335, df = 19, p-value = 0.0001812
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
0.5901259 1.5624972
sample estimates:
mean of x
1.076312
This time the result is significant.
To median centre the values in number_series
we simply calculate the median value and then subtract this from each of the values in the vector.
- median(number_series) number_series
[1] -148500 -145500 -142500 -139500 -136500 -133500 -130500 -127500 -124500
[10] -121500 -118500 -115500 -112500 -109500 -106500 -103500 -100500 -97500
[19] -94500 -91500 -88500 -85500 -82500 -79500 -76500 -73500 -70500
[28] -67500 -64500 -61500 -58500 -55500 -52500 -49500 -46500 -43500
[37] -40500 -37500 -34500 -31500 -28500 -25500 -22500 -19500 -16500
[46] -13500 -10500 -7500 -4500 -1500 1500 4500 7500 10500
[55] 13500 16500 19500 22500 25500 28500 31500 34500 37500
[64] 40500 43500 46500 49500 52500 55500 58500 61500 64500
[73] 67500 70500 73500 76500 79500 82500 85500 88500 91500
[82] 94500 97500 100500 103500 106500 109500 112500 115500 118500
[91] 121500 124500 127500 130500 133500 136500 139500 142500 145500
[100] 148500
To calculate the mean and standard deviation of these median centred values, we can pass them to the mean
and sd
functions.
<- number_series - median(number_series)
median_centred
mean(median_centred)
[1] 0
sd(median_centred)
[1] 87034.48
If we are sampling from two distributions with only a 1% difference in their means, how many observations do we need to have before we can detect them as significantly changing?
Let’s try a few different thresholds to see.
<- 100
samples
<- rnorm(samples, mean = 10, sd = 2)
data1 <- rnorm(samples, mean = 10.1, sd = 2)
data2
t.test(data1, data2)
Welch Two Sample t-test
data: data1 and data2
t = 0.281, df = 197.65, p-value = 0.779
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.5131564 0.6836997
sample estimates:
mean of x mean of y
10.15606 10.07079
<- 500
samples
<- rnorm(samples, mean = 10, sd = 2)
data1 <- rnorm(samples, mean = 10.1, sd = 2)
data2
t.test(data1, data2)
Welch Two Sample t-test
data: data1 and data2
t = -0.72221, df = 993.57, p-value = 0.4703
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.3476367 0.1605922
sample estimates:
mean of x mean of y
10.04706 10.14058
<- 1000
samples
<- rnorm(samples, mean = 10, sd = 2)
data1 <- rnorm(samples, mean = 10.1, sd = 2)
data2
t.test(data1, data2)
Welch Two Sample t-test
data: data1 and data2
t = -2.5475, df = 1997.2, p-value = 0.01092
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.39747003 -0.05169393
sample estimates:
mean of x mean of y
9.971158 10.195740
<- 5000
samples
<- rnorm(samples, mean = 10, sd = 2)
data1 <- rnorm(samples, mean = 10.1, sd = 2)
data2
t.test(data1, data2)
Welch Two Sample t-test
data: data1 and data2
t = -4.3498, df = 9997.9, p-value = 1.376e-05
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.25087714 -0.09500668
sample estimates:
mean of x mean of y
9.962487 10.135429
It’s only really when we get up close to 5000 samples that we can reliably detect such a small difference. The answers will be different every time since rnorm
involves a random component.
We’re going to read some data from a file straight into R. To do this we’re going to use the tidyverse read_
functions. We therefore need to load tidyverse into our script.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
We’ll start off by reading in a small file.
<- read_tsv("small_file.txt") small
Rows: 40 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (2): Sample, Category
dbl (1): Length
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
small
Note that the only relevant name from now on is small
which is the name we saved the data under. The original file name is irrelevant after the data is loaded.
We can see that Sample and Category have the ‘character’ type because they are text. Length has the ‘double’ type because it is a number.
We want to find the median of the log2 transformed lengths.
To start with we need to extract the length column using the $
notation.
$Length small
[1] 45 82 81 56 96 85 65 96 60 62 80 63 50 64 43 98 78 53 100
[20] 79 84 68 99 65 55 98 56 83 81 69 50 72 54 56 87 84 80 68
[39] 95 93
Now we can log2 transform this.
log2(small$Length)
[1] 5.491853 6.357552 6.339850 5.807355 6.584963 6.409391 6.022368 6.584963
[9] 5.906891 5.954196 6.321928 5.977280 5.643856 6.000000 5.426265 6.614710
[17] 6.285402 5.727920 6.643856 6.303781 6.392317 6.087463 6.629357 6.022368
[25] 5.781360 6.614710 5.807355 6.375039 6.339850 6.108524 5.643856 6.169925
[33] 5.754888 5.807355 6.442943 6.392317 6.321928 6.087463 6.569856 6.539159
Finally we can find the median of this.
median(log2(small$Length))
[1] 6.227664
The second file we’re going to read is a CSV file of variant data. We can still use read_delim
to read it in.
<- read_delim("Child_Variants.csv") child
Rows: 25822 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): CHR, dbSNP, REF, ALT, GENE, ENST
dbl (5): POS, QUAL, MutantReads, COVERAGE, MutantReadPercent
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
child
We can see that each row is a set of different parameters for a genetic variant.
We want to calculate the mean and standard deviation (sd) of the MutantReadPercent column.
We can again use the $
notation to extract the relevant column into a vector. We can use head
to just look at the first few values.
head(child$MutantReadPercent, n = 20)
[1] 90 42 28 46 100 100 75 88 100 100 90 92 97 95 80 89 89 81 33
[20] 90
We can then pass this to the relevant functions to calculate the mean and sd.
mean(child$MutantReadPercent)
[1] 59.88219
sd(child$MutantReadPercent)
[1] 24.11675
In this section we are going to use two tidyverse functions to cut down our data. We will use select
to pick only certain columns of information to keep, and filter
to apply functional selections to the rows.
Extract all of the data for Category A.
filter(small, Category == "A")
Filter for all samples with a length of more than 80.
filter(small, Length > 80)
Remove the Sample column. We do a negative column selection by using a minus before the name.
select(small, -Sample)
We could have instead selected the columns we wanted to keep.
select(small, Length, Category)
Select data from the mitochondrion (called MT in this data).
filter(child, CHR == "MT")
Select variants with a MutantReadPercent of greater than (or equal to) 70.
filter(child, MutantReadPercent >= 70)
Select variants with a quality of 200.
filter(child, QUAL == 200)
Select all variants of the IGFN1 gene.
filter(child, GENE == "IGFN1")
Remove the ENST and dbSNP columns.
select(child, -ENST, -dbSNP)
Here we are going to do some more complicated multi-step filtering. For this we will be using the %>%
or |>
pipe operator to link together different steps in the selection and filtering.
We want variants with a quality of 200, coverage of more than 50 and MutantReadPercent of more than 70.
%>%
child filter(QUAL == 200) %>%
filter(COVERAGE > 50) %>%
filter(MutantReadPercent > 70)
Same as above but using the base R pipe.
|>
child filter(QUAL == 200) |>
filter(COVERAGE > 50) |>
filter(MutantReadPercent > 70)
We want to remove all variants on the X, Y and MT chromosomes. At the moment we will have to do this in three separate filtering steps, but there is actually a quicker way to do it in one step if we get a bit more advanced.
|>
child filter(CHR != "X") |>
filter(CHR != "Y") |>
filter(CHR != "MT")
We want to see the chromosome and position only for variants with a valid dbSNP id. We can get the valid dbSNP variants by excluding the variants with a dot as their ID.
|>
child filter(dbSNP != ".") |>
select(CHR, POS)
Here we’re going to do a more complicated filtering where we will transform the data within the filter statement. In this example we’re going to find deletions. In a deletion the REF column will have more than one character in it, so that’s what we’re going to select for.
We can use the nchar()
function as before, but this time we’re going to use it in a filter condition.
|>
child filter(nchar(REF) > 1)
As a second example we’re going to use another transformation along with a conventional selection to find genes whose name starts with a Q.
We will use the substr()
function to find the genes which start with Q.
|>
child filter(substr(GENE, 1, 1) == "Q")
To break this down.
<- c("AK5", "NEXN", "QSER1", "GIPC2", "QPRT")
gene_names substr(gene_names, 1, 1)
[1] "A" "N" "Q" "G" "Q"
substr(gene_names, 1, 1) == "Q"
[1] FALSE FALSE TRUE FALSE TRUE
This output is a logical vector. Inside our filter functions we want logical tests, and these logical tests produce logical vectors that are used to filter the dataset.
Here we’re going to draw our first ggplot plot. We’re going to use a new dataset so we need to load that first.
<- read_delim("brain_bodyweight.txt")
brain
brain
We’ll start out by doing a simple scatterplot (using geom_point()
) of the brain and bodyweight values.
|>
brain ggplot(aes(x = log.brain, y = log.body)) +
geom_point()
We can change the colour and size of the points by adding some options to the geom_point()
call. These are fixed values rather than aesthetic mappings so we don’t need to put them in a call to aes()
.
%>%
brain ggplot(aes(x = log.brain, y = log.body)) +
geom_point(colour = "blue", size = 3)
Before we modified the plot by adding in a fixed colour parameter. Here we’re going to add a third aesthetic mapping where we use the colour of the points to represent the Category that the animal falls into.
%>%
brain ggplot(aes(x = log.brain, y = log.body, colour = Category)) +
geom_point(size = 3)
Now we can see the three groups coloured differently, and a legend explaining the colours appears on the right.
We can remove the extinct species from the plot by filtering the dataset before we pipe it into ggplot.
|>
brain filter(Category != "Extinct") |>
ggplot(aes(x = log.brain, y = log.body, colour = Category)) +
geom_point(size = 3)
We’re going to get into more complex plots here, with more configuration of the static and mapped aesthetics, and the inclusion of more geometry layers.
We want to see the lengths of all of the samples in category A in the small data set.
%>%
small filter(Category == "A") %>%
ggplot(aes(x = Sample, y = Length)) +
geom_col()
We’re going to plot out the distribution of the MutantReadPercent values from child in a couple of different ways. In each case the plot itself is going to summarise the data for us.
We’ll start with a histogram.
%>%
child ggplot(aes(MutantReadPercent)) +
geom_histogram()
We can pass options to change the colouring of the bars and the resolution of the histogram.
%>%
child ggplot(aes(MutantReadPercent)) +
geom_histogram(binwidth = 5, fill = "seagreen", colour = "black")
We can also use geom_density()
to give a continuous readout of the frequency of values rather than using the binning approach of geom_histogram()
.
%>%
child ggplot(aes(MutantReadPercent)) +
geom_density(fill = "seagreen", colour = "black")
We can also construct a barplot which will summarise our data for us. In the previous barplot example we explicitly set the heights of the bars using an aesthetic mapping. Here we’re going to just pass a set of values to barplot and get it to count them and plot out the frequency for each unique value.
We’re going to do this for the REF bases, but only where we only have single letter REFs.
%>%
child filter(nchar(REF) == 1) %>%
ggplot(aes(REF)) +
geom_bar()
If we wanted to make this a bit more colourful we could colour the bars by the REF (so they’ll also be different colours). This would normally create a legend to explain the colours, but because this is self-explanatory in this case we can use the show.legend
parameter to suppress it.
%>%
child filter(nchar(REF) == 1) %>%
ggplot(aes(REF, fill = REF)) +
geom_bar(show.legend = FALSE)
In a previous exercise we drew a plot of the log brainweight vs the log bodyweight. In the scatterplot we drew we couldn’t actually see which animal was represented by which dot, so we’re going to fix that now.
We can add the names to the animals by adding a geom_text
layer to the plot. This will use the same x
and y
aesthetics as the points (so the labels go in the same place), but it will define a new aesthetic mapping of label with the names of the species.
We can add the new aesthetic mapping either in the original ggplot call, or in the geom_text call.
%>%
brain ggplot(aes(x = log.brain, y = log.body, color = Category, label = Species)) +
geom_point() +
geom_text()
There are a few improvements we could make:
geom_text()
to fix this.size
parameter to geom_text()
for this.hjust
.|>
brain ggplot(aes(x = log.brain, y = log.body, color = Category, label = Species)) +
geom_point(size = 3) +
geom_text(colour = "black", size = 2, hjust = 1.2) +
ggtitle("Brainweight vs Bodyweight") +
xlab("Brain weight (log2 g)") +
ylab("Body weight (log2 kg)")
It’s still not perfect. There are other options but we’ll stick with this for now.
We can see what appears to be a relationship between the weight of the brain and the body, but is it real? We can do a correlation using the cor.test
function to check.
The function takes two vectors as arguments, so we need to extract the two relevant columns out of the brain data and pass them to the function.
cor.test(brain$log.body, brain$log.brain)
Pearson's product-moment correlation
data: brain$log.body and brain$log.brain
t = 6.0666, df = 25, p-value = 2.44e-06
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.5541838 0.8905446
sample estimates:
cor
0.7716832
We can see that the association is significant with p=2.44e-06. We can add this information to the graph.
We can also calculate a line of best fit for these two variables. We will do this with the lm()
function which builds a linear model.
Building models is beyond the scope of this course, but it is different to the cor.test()
in as much as you pass the whole of the data tibble and then say which parameters you want to predict which other parameter.
The model we will build for the brain
dataset is: log.body ~ log.brain
which will build a model to predict log bodyweight from log brainweight.
lm(data = brain, formula = log.body ~ log.brain)
Call:
lm(formula = log.body ~ log.brain, data = brain)
Coefficients:
(Intercept) log.brain
-2.283 1.215
This tells us that the model line which it constructs has a slope of 1.215 and an intercept of -2.283. We can use the geom_abline()
geometry to add this as a new layer to the existing plot.
%>%
brain ggplot(aes(x = log.brain, y = log.body, color = Category, label = Species)) +
geom_point(size = 3) +
geom_text(color = "black", size = 3, hjust = 1.2) +
ggtitle("Brainweight vs Bodyweight (corr p=2.44e-06)") +
xlab("Brain weight (log2 g)") +
ylab("Body weight (log2 kg)") +
geom_abline(slope = 1.215, intercept = -2.283)
We can see that the extinct species have their own little outgroup away from everything else. They have much smaller brains than their bodyweight would predict.
|>
brain ggplot(aes(x = log.brain, y = log.body, color = Category, label = Species)) +
geom_point(size = 3) +
geom_text(colour = "black", size = 2, hjust = 1.2) +
ggtitle("Brainweight vs Bodyweight (corr p=2.44e-06)") +
xlab("Brain weight (log2 g)") +
ylab("Body weight (log2 kg)")