These are the answers to the first part of the package development course. The second part involves actually developing a package so you can’t show that in a notebook.
Write the following function, gc_content(seq)
Takes in a vector of DNA sequences and returns a vector of %GC values (what percentage of the bases are G or C). You should verify that the argument is really a character vector and throw an error if not. You should verify that the bases in the strings only consist of GAT or C and throw a warning if other bases are found. The function should be able to cope with upper or lower case text.
Run the function with data which validates that each of the requirements above is fulfilled.
We’ll load the assertthat library to make our checks easier.
library(assertthat)
library(tidyverse)
gc_content <- function(seq) {
assert_that(is.character(seq))
if (any(str_detect(seq,"[^GATC]"))) {
warning("Non GATC characters found in sequences")
}
seq <- toupper(seq)
str_replace_all(seq,"[^GC]","") -> just_gc
return(100*(nchar(just_gc)/nchar(seq)))
}
This should work
gc_content(c("GAGCTG", "GGTAGAT", "gattaggac","gttgatgat"))
## Warning in gc_content(c("GAGCTG", "GGTAGAT", "gattaggac", "gttgatgat")): Non
## GATC characters found in sequences
## [1] 66.66667 42.85714 44.44444 33.33333
This should fail
gc_content(c(2,5,12))
## Error: seq is not a character vector
This should give a warning
gc_content(c("GAGCTG", "GGNNAGAT", "gattaggac","gttgatgat"))
## Warning in gc_content(c("GAGCTG", "GGNNAGAT", "gattaggac", "gttgatgat")): Non
## GATC characters found in sequences
## [1] 66.66667 37.50000 44.44444 33.33333
Write the following function read_fastq(file)
This function should read a fastq file and put it into a tibble. Fastq files are split in to 4 line records where the lines are:
@
symbol. Sequence ID is anything after that. IDs should be unique+
generally ignored.An example of a fastq file is this:
@1HWUSI-EAS460:44:661VRAAXX:2:1:15253:1153
GCCNGGCTATGCAAGCAGGCTGCAGTGTGGATATAGTCGT
+1HWUSI-EAS460:44:661VRAAXX:2:1:15253:1153
???#;ABAAAHHHHGHFGDHEG@GG@GDGGB>DDDGBDD=
@2HWUSI-EAS460:44:661VRAAXX:2:1:17398:1153
CAGNGAATCCTTGAGGCACCTTCTCTTATAAAAACA
+2HWUSI-EAS460:44:661VRAAXX:2:1:17398:1153
BBB#BFFFEFHHHHHDHHHHHHHHHHHHHHHHHHHH
The function should read a fastq file and put it into a tibble with one row per fastq entry where the columns are:
@
)You should verify that the fastq location provided is a file that you can read and has an extension of .fq. You should check that all of the IDs in the file are unique and that the length of the bases matches the length of the qualities. If any of these is not true then an error should be produced.
read_fastq <- function(file) {
assert_that(is.readable(file))
assert_that(has_extension(file,"fq"))
scan(file, character()) -> file.lines
file.lines[c(T,F,F,F)] -> ids
file.lines[c(F,T,F,F)] -> sequences
file.lines[c(F,F,F,T)] -> qualities
if (!all(startsWith(ids,"@"))) {
stop("Some ID lines didn't start with @")
}
str_sub(ids,2) -> ids
if (!all(nchar(sequences)==nchar(qualities))) {
stop("Some sequences were a different length to the qualities")
}
if (any(duplicated(ids))) {
stop("Some IDs are duplicated")
}
tibble(ID = ids, Bases=sequences, Qualities=qualities, GC=gc_content(sequences)) %>%
return()
}
Now test this with the example files
read_fastq("fastq_examples/good.fq")
## Warning in gc_content(sequences): Non GATC characters found in sequences