Sherman - bisulfite-treated Read FastQ Simulator
Function | A tool to simulate FastQ files for high-throughput sequencing experiments |
---|---|
Language | Perl |
Mission Statement | "I'm a sophisticated bisulfite library simulator, sent back in time to change the future for one lucky scientist" |
Code Maturity | Stable |
Code Released | Yes, under GNU GPL v3 or later. |
Initial Contact | Felix Krueger |
Download Now |
Sherman can simulate ungapped high-throughput datasets for bisulfite sequencing (BS-Seq) or standard experiments. It allows the user to introduce various 'contaminants' into the sequences, such as basecall errors, SNPs, adapter fragments etc., in order to evaluate the influence of common problems observed in many Next-Gen Sequencing experiments.
- Generate any number of sequences of any length
- Generate either completely random sequences or use genomic sequences (genome can be specified)
- Generates single-end or paired-end data
- Adjustable bisulfite conversion rate from 0-100% for either all cytosines or cytosines in CH and CG context individually
- Generate directional or non-directional libraries
- Generate sequences in base-space or SOLiD color-space format
- Adjustable default Phred quality score (Sanger encoding, Phred+33 format)
- Sequences can have constant Phred qualities throughout the read or can have quality scores following an exponential decay curve, which will eventually result in basecall errors (note that this is handled slightly different for base- and color-space data)
- Introduce a variable number of random SNPs into each read
- Introduce a fixed amount of adapter sequence at the 3' end of all sequences
- Introduce a variable amount of adapter sequence at various positions at the 3' end of reads
Here is the link to the more detailed Sherman User Manual (on Github)
Changelog
- 16-10-18: Version 0.1.8 released (click here for the Release Notes hosted on Github)
- 11-08-14: Version 0.1.7 released
-
- Fixed a 1-off length issue that might have occurred for some sequences with variable length adapter contamination
- 09-09-13: Version 0.1.6 released
-
- Fixed several bugs with the length of the quality string that were inadvertently introduced in previous versions. Sequence and quality strings should now have the same length again, and the genomic coordinates of single-end reads are being shown correctly
- 24-07-13: Version 0.1.5 released
-
- Sequences are no longer 1bp longer than specified in '--CR 0' mode
- Quality scores are no longer 1bp longer than the sequences
- 12-07-13: Version 0.1.4 released
-
- During context specific cytosine conversion, for simplicity Sherman assumed that a C at the last position was in CH context. This did however cause a weird blip in the M-bias plots of simulated data at the end or read 1 and at the start of read 2. To account for this, Sherman does now determine the sequence context of the last position in a read correctly
- 18-12-12: Version 0.1.3 released
-
- Changed the third line of basespace FastQ reads to be a "+" only. This saves disk space and prevents crashing other programs such as Cutadapt
- 05-09-12: Version 0.1.2 released
-
- Reads simulated from existing genomes will have their genomic coordinates printed into the read ID line in addition to the read count to keep IDs unique
- 09-01-12: Version 0.1.1 released
-
- The bisulfite conversion rate can now be any float number between 0 and 100% (instead of integers only)
- Improved handling of input genomes containing DNA ambiguity characters or \r line endings
- Fixed a bug while generating non-directional paired-end libraries. This feature is now working as intended.
- 15-07-11: Version 0.1 released
-
- Initial release
- All basic functions working