QC and evaluation of long read data

Overview

Teaching: 30 min
Exercises: 60-90 min
Questions
  • How to determine basic statistics on read sets

  • What do these statistics mean: are they good or bad?

  • What are read technology specific errors?

Objectives
  • Learn how to get and interpret read set statistics

  • Evaluate read technology specific errors based on read mappings

Introduction

Quality control of input data is an important task in any data analysis and is very much true for de novo assemblies. Biases, high error reads and other unexpected properties will have an impact on the resulting contigs by producing fragmented assemblies and/or chimeric contigs. In this session we will have a closer look at our input data before we continue with the assembly process.

Read statistics

For basic the quality control of the read sets we use assembly-stats:

gt-workshop@gtworkshop-VirtualBox:~$ ./tools/assembly-stats-master/build/assembly-stats 
usage: stats [options] <list of fasta/q files>

Reports sequence length statistics from fasta and/or fastq files

options:
-l <int>
	Minimum length cutoff for each sequence.
	Sequences shorter than the cutoff will be ignored [1]
-s
	Print 'grep friendly' output
-t
	Print tab-delimited output
-u
	Print tab-delimited output with no header line
-v
	Print version and exit

This tool will report to the terminal information like average sequence length and N50.

Read statistics

The PacBio reads are in this file: data/raw_data/pacbio_reads.fasta. Get the read statistics on this data set and try to interpret the results: what does each of the entries mean?

Solution

gt-workshop@gtworkshop-VirtualBox:~$ ./tools/assembly-stats-master/build/assembly-stats data/raw_data/pacbio_reads.fasta 
stats for data/raw_data/pacbio_reads.fasta
sum = 48280339, n = 3862, ave = 12501.38, largest = 70567
N50 = 19571, n = 840
N60 = 16667, n = 1107
N70 = 13410, n = 1427
N80 = 10361, n = 1833
N90 = 6819, n = 2399
N100 = 74, n = 3862
N_count = 0
Gaps = 0

The Nanopore reads are in this file: data/raw_data/nanopore_reads.fastq. Also check the statistics of the Nanopore data.

Solution

gt-workshop@gtworkshop-VirtualBox:~$ ./tools/assembly-stats-master/build/assembly-stats data/raw_data/nanopore_reads.fastq 
stats for data/raw_data/nanopore_reads.fastq
sum = 20742020, n = 1449, ave = 14314.71, largest = 73689
N50 = 22393, n = 324
N60 = 19386, n = 424
N70 = 15673, n = 544
N80 = 12226, n = 694
N90 = 7987, n = 900
N100 = 384, n = 1449
N_count = 0
Gaps = 0

And finally the Illumina reads: data/raw_data/illumina_R1.fastq and data/raw_data/illumina_R2.fastq

Solution

./tools/assembly-stats-master/build/assembly-stats data/raw_data/illumina_R1.fastq data/raw_data/illumina_R2.fastq 
stats for data/raw_data/illumina_R1.fastq
sum = 21846250, n = 87385, ave = 250.00, largest = 250
N50 = 250, n = 43693
N60 = 250, n = 52431
N70 = 250, n = 61170
N80 = 250, n = 69908
N90 = 250, n = 78647
N100 = 250, n = 87385
N_count = 0
Gaps = 0
-------------------------------------------------------------------------------
stats for data/raw_data/illumina_R2.fastq
sum = 21846250, n = 87385, ave = 250.00, largest = 250
N50 = 250, n = 43693
N60 = 250, n = 52431
N70 = 250, n = 61170
N80 = 250, n = 69908
N90 = 250, n = 78647
N100 = 250, n = 87385
N_count = 0
Gaps = 0

Discuss the difference between the three platforms.

Aligning the reads to the reference genome

A further visual inspection of the data can be done by mapping the reads to the reference genome. For each platform many tools exist to help you with this. We will focus on the newest and/or most commonly used tools. The alignments will give you ideas on SNP rates, biases in the data, heterozygosity and error rates.

The reference genome

The reads are from A. Thaliana cultivar, which is a well-known and often sequenced plant. We can expect that the reads have high similarity when compare to this genome. However, some issues have to be taken into account when mapping reads to a genome. Discuss what effect will be of the following issues when looking at the mapped reads:

  1. The genome of the organism is not known. You will have to map the reads to the genome of a relative.
  2. Your reference genome is highly repetitive.
  3. Your reference genome is not nearly complete.
  4. You have a polyploid organism.
  5. You are faced with issues 2-5

For mapping the reads we will use minimap2:

./tools/minimap2-2.11_x64-linux/minimap2

And a 1MB region of our cultivar (assembled earlier):

./data/references/reference1MB.fasta

Mapping reads

Use minimap2 to map the three read sets to our region of 1MB. You can use the Illumina and PacBio subsampled data to speed things up:

./data/raw_data/illumina_R1.subsample.fastq
./data/raw_data/illumina_R2.subsample.fastq
./data/pacbio_reads.subsample.fasta

The nanopore data has not been subsampled. Run minimap2 without arguments to see which options you can use. For subsequent steps you need SAM output.

Solution

mkdir results
./tools/minimap2-2.11_x64-linux/minimap2 -a -x sr ./data/references/reference1MB.fasta ./data/raw_data/illumina_R1.subsample.fastq ./data/raw_data/illumina_R2.subsample.fastq > ./results/illumina.sam
./tools/minimap2-2.11_x64-linux/minimap2 -x map-pb -a ./data/references/reference1MB.fasta ./data/pacbio_reads.subsample.fasta > results/pacbio_reads.subsample.sam
./tools/minimap2-2.11_x64-linux/minimap2 -a -x map-ont ./data/references/reference1MB.fasta ./data/raw_data/nanopore_reads.fastq > results/nanopore.sam

Visualizing the results

Tablet is a nice and compact SAM/BAM viewer. It is written in Java and it runs on most systems.

./tools/Tablet/tablet

Open Tablet and load the reference plus a SAM file. For each of the three results files, visually check:

  1. The number of mismatches, insertions and deletions (Tablet supports different Color schemes to help with this).
  2. Coverage across the region

Discuss the differences between the three results. For example:

  1. What type of errors do you see?
  2. How many of them?
  3. What will be the likely impact on the assembly?

TAIR10 Reference genome

Our cultivar is not the reference genome A. thaliana known as TAIR10. Map the three read sets against this genome and visualize the results in Tablet:

./data/references/TAIR10_reference1MB.fasta

Discuss:

  1. Most obvious differences between the mapping results.
  2. The reference sequences are from two different cultivars. Are the differences between mapping results what you expect?
  3. Do you see any differences in mapping information for each of the sequencing platforms?
  4. After assembly, with what be key differences between the assemblied sequences and the TAIR10 genome?

Key Points

  • PacBio and Nanopore have read lengths of many kilobases

  • In practice quality of input DNA, library prep. and many other factors will determine read length distribution

  • PacBio and Nanopore have different types of errors

Used applications

  • ./tools/assembly-stats-master/build/assembly-stats

  • ./tools/minimap2-2.11_x64-linux/minimap2

  • ./tools/Tablet/tablet