Babraham Bioinformatics - Sherman: bisulfite-treated Read FastQ Simulator

Sherman - bisulfite-treated Read FastQ Simulator

Function	A tool to simulate FastQ files for high-throughput sequencing experiments
Language	Perl
Mission Statement	"I'm a sophisticated bisulfite library simulator, sent back in time to change the future for one lucky scientist"
Code Maturity	Stable
Code Released	Yes, under GNU GPL v3 or later.
Initial Contact	Felix Krueger
Download Now

Sherman can simulate ungapped high-throughput datasets for bisulfite sequencing (BS-Seq) or standard experiments. It allows the user to introduce various 'contaminants' into the sequences, such as basecall errors, SNPs, adapter fragments etc., in order to evaluate the influence of common problems observed in many Next-Gen Sequencing experiments.

Generate any number of sequences of any length
Generate either completely random sequences or use genomic sequences (genome can be specified)
Generates single-end or paired-end data
Adjustable bisulfite conversion rate from 0-100% for either all cytosines or cytosines in CH and CG context individually
Generate directional or non-directional libraries
Generate sequences in base-space or SOLiD color-space format
Adjustable default Phred quality score (Sanger encoding, Phred+33 format)
Sequences can have constant Phred qualities throughout the read or can have quality scores following an exponential decay curve, which will eventually result in basecall errors (note that this is handled slightly different for base- and color-space data)
Introduce a variable number of random SNPs into each read
Introduce a fixed amount of adapter sequence at the 3' end of all sequences
Introduce a variable amount of adapter sequence at various positions at the 3' end of reads

Here is the link to the more detailed Sherman User Manual (on Github)

Changelog

16-10-18: Version 0.1.8 released (click here for the Release Notes hosted on Github)
11-08-14: Version 0.1.7 released
- Fixed a 1-off length issue that might have occurred for some sequences with variable length adapter contamination
09-09-13: Version 0.1.6 released
- Fixed several bugs with the length of the quality string that were inadvertently introduced in previous versions. Sequence and quality strings should now have the same length again, and the genomic coordinates of single-end reads are being shown correctly
24-07-13: Version 0.1.5 released
- Sequences are no longer 1bp longer than specified in '--CR 0' mode
- Quality scores are no longer 1bp longer than the sequences
12-07-13: Version 0.1.4 released
- During context specific cytosine conversion, for simplicity Sherman assumed that a C at the last position was in CH context. This did however cause a weird blip in the M-bias plots of simulated data at the end or read 1 and at the start of read 2. To account for this, Sherman does now determine the sequence context of the last position in a read correctly
18-12-12: Version 0.1.3 released
- Changed the third line of basespace FastQ reads to be a "+" only. This saves disk space and prevents crashing other programs such as Cutadapt
05-09-12: Version 0.1.2 released
- Reads simulated from existing genomes will have their genomic coordinates printed into the read ID line in addition to the read count to keep IDs unique
09-01-12: Version 0.1.1 released
- The bisulfite conversion rate can now be any float number between 0 and 100% (instead of integers only)
- Improved handling of input genomes containing DNA ambiguity characters or \r line endings
- Fixed a bug while generating non-directional paired-end libraries. This feature is now working as intended.
15-07-11: Version 0.1 released
- Initial release
- All basic functions working

Sherman - bisulfite-treated Read FastQ Simulator

Download Now

Changelog