QC and evaluation of long read data
Overview
Teaching: 30 min
Exercises: 60-90 minQuestions
How to determine basic statistics on read sets
What do these statistics mean: are they good or bad?
What are read technology specific errors?
Objectives
Learn how to get and interpret read set statistics
Evaluate read technology specific errors based on read mappings
Introduction
Quality control of input data is an important task in any data analysis and is very much true for de novo assemblies. Biases, high error reads and other unexpected properties will have an impact on the resulting contigs by producing fragmented assemblies and/or chimeric contigs. In this session we will have a closer look at our input data before we continue with the assembly process.
Read statistics
For basic the quality control of the read sets we use assembly-stats:
gt-workshop@gtworkshop-VirtualBox:~$ ./tools/assembly-stats-master/build/assembly-stats
usage: stats [options] <list of fasta/q files>
Reports sequence length statistics from fasta and/or fastq files
options:
-l <int>
Minimum length cutoff for each sequence.
Sequences shorter than the cutoff will be ignored [1]
-s
Print 'grep friendly' output
-t
Print tab-delimited output
-u
Print tab-delimited output with no header line
-v
Print version and exit
This tool will report to the terminal information like average sequence length and N50.
Read statistics
The PacBio reads are in this file: data/raw_data/pacbio_reads.fasta. Get the read statistics on this data set and try to interpret the results: what does each of the entries mean?
Solution
gt-workshop@gtworkshop-VirtualBox:~$ ./tools/assembly-stats-master/build/assembly-stats data/raw_data/pacbio_reads.fasta stats for data/raw_data/pacbio_reads.fasta sum = 48280339, n = 3862, ave = 12501.38, largest = 70567 N50 = 19571, n = 840 N60 = 16667, n = 1107 N70 = 13410, n = 1427 N80 = 10361, n = 1833 N90 = 6819, n = 2399 N100 = 74, n = 3862 N_count = 0 Gaps = 0
The Nanopore reads are in this file: data/raw_data/nanopore_reads.fastq. Also check the statistics of the Nanopore data.
Solution
gt-workshop@gtworkshop-VirtualBox:~$ ./tools/assembly-stats-master/build/assembly-stats data/raw_data/nanopore_reads.fastq stats for data/raw_data/nanopore_reads.fastq sum = 20742020, n = 1449, ave = 14314.71, largest = 73689 N50 = 22393, n = 324 N60 = 19386, n = 424 N70 = 15673, n = 544 N80 = 12226, n = 694 N90 = 7987, n = 900 N100 = 384, n = 1449 N_count = 0 Gaps = 0
And finally the Illumina reads: data/raw_data/illumina_R1.fastq and data/raw_data/illumina_R2.fastq
Solution
./tools/assembly-stats-master/build/assembly-stats data/raw_data/illumina_R1.fastq data/raw_data/illumina_R2.fastq stats for data/raw_data/illumina_R1.fastq sum = 21846250, n = 87385, ave = 250.00, largest = 250 N50 = 250, n = 43693 N60 = 250, n = 52431 N70 = 250, n = 61170 N80 = 250, n = 69908 N90 = 250, n = 78647 N100 = 250, n = 87385 N_count = 0 Gaps = 0 ------------------------------------------------------------------------------- stats for data/raw_data/illumina_R2.fastq sum = 21846250, n = 87385, ave = 250.00, largest = 250 N50 = 250, n = 43693 N60 = 250, n = 52431 N70 = 250, n = 61170 N80 = 250, n = 69908 N90 = 250, n = 78647 N100 = 250, n = 87385 N_count = 0 Gaps = 0
Discuss the difference between the three platforms.
Aligning the reads to the reference genome
A further visual inspection of the data can be done by mapping the reads to the reference genome. For each platform many tools exist to help you with this. We will focus on the newest and/or most commonly used tools. The alignments will give you ideas on SNP rates, biases in the data, heterozygosity and error rates.
The reference genome
The reads are from A. Thaliana cultivar, which is a well-known and often sequenced plant. We can expect that the reads have high similarity when compare to this genome. However, some issues have to be taken into account when mapping reads to a genome. Discuss what effect will be of the following issues when looking at the mapped reads:
- The genome of the organism is not known. You will have to map the reads to the genome of a relative.
- Your reference genome is highly repetitive.
- Your reference genome is not nearly complete.
- You have a polyploid organism.
- You are faced with issues 2-5
For mapping the reads we will use minimap2:
./tools/minimap2-2.11_x64-linux/minimap2
And a 1MB region of our cultivar (assembled earlier):
./data/references/reference1MB.fasta
Mapping reads
Use minimap2 to map the three read sets to our region of 1MB. You can use the Illumina and PacBio subsampled data to speed things up:
./data/raw_data/illumina_R1.subsample.fastq ./data/raw_data/illumina_R2.subsample.fastq
./data/pacbio_reads.subsample.fasta
The nanopore data has not been subsampled. Run minimap2 without arguments to see which options you can use. For subsequent steps you need SAM output.
Solution
mkdir results ./tools/minimap2-2.11_x64-linux/minimap2 -a -x sr ./data/references/reference1MB.fasta ./data/raw_data/illumina_R1.subsample.fastq ./data/raw_data/illumina_R2.subsample.fastq > ./results/illumina.sam ./tools/minimap2-2.11_x64-linux/minimap2 -x map-pb -a ./data/references/reference1MB.fasta ./data/pacbio_reads.subsample.fasta > results/pacbio_reads.subsample.sam ./tools/minimap2-2.11_x64-linux/minimap2 -a -x map-ont ./data/references/reference1MB.fasta ./data/raw_data/nanopore_reads.fastq > results/nanopore.sam
Visualizing the results
Tablet is a nice and compact SAM/BAM viewer. It is written in Java and it runs on most systems.
./tools/Tablet/tablet
Open Tablet and load the reference plus a SAM file. For each of the three results files, visually check:
- The number of mismatches, insertions and deletions (Tablet supports different Color schemes to help with this).
- Coverage across the region
Discuss the differences between the three results. For example:
- What type of errors do you see?
- How many of them?
- What will be the likely impact on the assembly?
TAIR10 Reference genome
Our cultivar is not the reference genome A. thaliana known as TAIR10. Map the three read sets against this genome and visualize the results in Tablet:
./data/references/TAIR10_reference1MB.fasta
Discuss:
- Most obvious differences between the mapping results.
- The reference sequences are from two different cultivars. Are the differences between mapping results what you expect?
- Do you see any differences in mapping information for each of the sequencing platforms?
- After assembly, with what be key differences between the assemblied sequences and the TAIR10 genome?
Key Points
PacBio and Nanopore have read lengths of many kilobases
In practice quality of input DNA, library prep. and many other factors will determine read length distribution
PacBio and Nanopore have different types of errors
Used applications
./tools/assembly-stats-master/build/assembly-stats
./tools/minimap2-2.11_x64-linux/minimap2
./tools/Tablet/tablet