Assembly of long reads (and Illumina reads)
Overview
Teaching: 30 min
Exercises: 120 minQuestions
How do you assembly NGS reads to create a de novo assembly?
What type of information is given by the assembler?
Objectives
To create de novo assemblies based on PacBio, Nanopore and Illumina data
Learn to interpet assembly results and logs
In this section we focus on creating assemblies using Platanus (Illumina) and Canu (PacBio / Nanopore).
Long read assemblies
We will start with Canu assemblies based on the PacBio and Nanopore data.
./tools/canu-1.7.1/Linux-amd64/bin/canu
Canu can use a configuration file (‘specs file’). For our assemblies we will use these settings:
useGrid=False
corOutCoverage=30 # default is 40
corMinCoverage=4 # default is 4
minOverlapLength=1000 # default is 500
minReadLength=1000
ovlMerDistinct=0.99 # change from auto mode to 0.99 in order to decrease runtime.
maxThreads=1
maxMemory=2.5G
merylMemory=2.5G
merylThreads=1
cormhapThreads=1
cormhapMemory=2.5G
corMemory=2.5G
corThreads=1
obtovlMemory=2.5G
obtovlThreads=1
utgovlMemory=2.5G
utgovlThreads=1
oeaMemory=2.5G
oeaThreads=1
Configuration file
Added the configuration line to a text file. The text editor nano is install in the VM. Name this file canu.spec.
You can change the ‘…Memory’ and ‘…/Threads’ to optimize Canu run time. With useGrid=True on SLURM/SGE Canu will automatically divide jobs on the cluster.
Run Canu on PacBio data
You can now run Canu using the following command:
./tools/canu-1.7.1/Linux-amd64/bin/canu -s results/canu.spec -pacbio-raw ./data/raw_data/pacbio_reads.fasta -p canu_pacbio -d results/canu_pacbio genomeSize=1M
Try to find out in the manual what each of the following settings mean and how they effect the assembly process:
- corOutCoverage
- corMinCoverage
- minOverlapLength
- minReadLength
- ovlMerDistinct
The Canu log file
During the assembly process, Canu generates a report file. In this case:
./results/canu_pacbio/canu_pacbio.report
While Canu is running, try to find out what steps Canu will take to assembly the reads and how it works. Have a look at this file while Canu is running. What do you see? Find the expected coverage after each step and try to explain what is happening.
Log file
[CORRECTION/READS] -- -- In gatekeeper store './canu_pacbio.gkpStore': -- Found 3666 reads. -- Found 48155442 bases (48.15 times coverage). -- -- Read length histogram (one '*' equals 3.65 reads): -- 0 999 0 -- 1000 1999 256 ********************************************************************** -- 2000 2999 242 ****************************************************************** -- 3000 3999 217 *********************************************************** -- 4000 4999 214 ********************************************************** -- 5000 5999 190 *************************************************** -- 6000 6999 179 ************************************************ -- 7000 7999 164 ******************************************** -- 8000 8999 163 ******************************************** -- 9000 9999 163 ******************************************** -- 10000 10999 129 *********************************** -- 11000 11999 129 *********************************** -- 12000 12999 137 ************************************* -- 13000 13999 115 ******************************* -- 14000 14999 86 *********************** -- 15000 15999 108 ***************************** -- 16000 16999 98 ************************** -- 17000 17999 95 ************************* -- 18000 18999 92 ************************* -- 19000 19999 82 ********************** -- 20000 20999 65 ***************** -- 21000 21999 75 ******************** -- 22000 22999 65 ***************** -- 23000 23999 66 ****************** -- 24000 24999 54 ************** -- 25000 25999 50 ************* -- 26000 26999 24 ****** -- 27000 27999 47 ************ -- 28000 28999 27 ******* -- 29000 29999 30 ******** -- 30000 30999 41 *********** -- 31000 31999 28 ******* -- 32000 32999 27 ******* -- 33000 33999 29 ******* -- 34000 34999 17 **** -- 35000 35999 16 **** -- 36000 36999 15 **** -- 37000 37999 20 ***** -- 38000 38999 9 ** -- 39000 39999 18 **** -- 40000 40999 9 ** -- 41000 41999 10 ** -- 42000 42999 10 ** -- 43000 43999 12 *** -- 44000 44999 6 * -- 45000 45999 6 * -- 46000 46999 3 -- 47000 47999 5 * -- 48000 48999 6 * -- 49000 49999 3 -- 50000 50999 2 -- 51000 51999 3 -- 52000 52999 0 -- 53000 53999 0 -- 54000 54999 2 -- 55000 55999 3 -- 56000 56999 1 -- 57000 57999 0 -- 58000 58999 0 -- 59000 59999 2 -- 60000 60999 0 -- 61000 61999 0 -- 62000 62999 0 -- 63000 63999 0 -- 64000 64999 0 -- 65000 65999 0 -- 66000 66999 0 -- 67000 67999 0 -- 68000 68999 0 -- 69000 69999 0 -- 70000 70999 1 [CORRECTION/MERS] -- -- 16-mers Fraction -- Occurrences NumMers Unique Total -- 1- 1 31076870 *******************************************************************--> 0.8726 0.6461 -- 2- 2 2664417 ********************************************************************** 0.9474 0.7569 -- 3- 4 884274 *********************** 0.9652 0.7964 -- 5- 7 435461 *********** 0.9767 0.8338 -- 8- 11 394234 ********** 0.9880 0.8921 -- 12- 16 129015 *** 0.9969 0.9592 -- 17- 22 18684 0.9993 0.9845 -- 23- 29 5477 0.9997 0.9902 -- 30- 37 2438 0.9998 0.9928 -- 38- 46 1339 0.9999 0.9944 -- 47- 56 697 0.9999 0.9954 -- 57- 67 537 1.0000 0.9962 -- 68- 79 299 1.0000 0.9968 -- 80- 92 131 1.0000 0.9972 -- 93- 106 62 1.0000 0.9975 -- 107- 121 53 1.0000 0.9976 -- 122- 137 68 1.0000 0.9977 -- 138- 154 47 1.0000 0.9979 -- 155- 172 29 1.0000 0.9980 -- 173- 191 22 1.0000 0.9981 -- 192- 211 11 1.0000 0.9982 -- 212- 232 11 1.0000 0.9983 -- 233- 254 15 1.0000 0.9983 -- 255- 277 19 1.0000 0.9984 -- 278- 301 7 1.0000 0.9985 -- 302- 326 10 1.0000 0.9985 -- 327- 352 2 1.0000 0.9986 -- 353- 379 2 1.0000 0.9986 -- 380- 407 7 1.0000 0.9986 -- 408- 436 1 1.0000 0.9987 -- 437- 466 4 1.0000 0.9987 -- 467- 497 2 1.0000 0.9987 -- 498- 529 2 1.0000 0.9988 -- 530- 562 2 1.0000 0.9988 -- 563- 596 5 1.0000 0.9988 -- 597- 631 4 1.0000 0.9989 -- 632- 667 3 1.0000 0.9989 -- 668- 704 2 1.0000 0.9990 -- 705- 742 9 1.0000 0.9990 -- 743- 781 5 1.0000 0.9991 -- 782- 821 1 1.0000 0.9992 -- -- 14537 (max occurrences) -- 17023582 (total mers, non-unique) -- 4537419 (distinct mers, non-unique) -- 31076870 (unique mers) [CORRECTION/LAYOUT] -- original original -- raw reads raw reads -- category w/overlaps w/o/overlaps -- -------------------- ------------- ------------- -- Number of Reads 3472 194 -- Number of Bases 47006928 254582 -- Coverage 47.007 0.255 -- Median 10895 0 -- Mean 13538 1312 -- N50 19897 6440 -- Minimum 1010 0 -- Maximum 70567 19425 -- -- --------corrected--------- ----------rescued---------- -- evidence expected expected -- category reads raw corrected raw corrected -- -------------------- ------------- ------------- ------------- ------------- ------------- -- Number of Reads 3371 1208 1208 22 22 -- Number of Bases 46798838 30348444 30004655 64841 48498 -- Coverage 46.799 30.348 30.005 0.065 0.048 -- Median 11315 22893 22621 2446 2123 -- Mean 13882 25122 24838 2947 2204 -- N50 19949 25690 25393 3696 2778 -- Minimum 1030 15224 15213 1233 1038 -- Maximum 70567 59605 59594 7116 5103 -- -- --------uncorrected-------- -- expected -- category raw corrected -- -------------------- ------------- ------------- -- Number of Reads 2436 2436 -- Number of Bases 16848225 14328886 -- Coverage 16.848 14.329 -- Median 6393 5474 -- Mean 6916 5882 -- N50 9854 10191 -- Minimum 0 0 -- Maximum 70567 15204 -- -- Maximum Memory 1160509640 [TRIMMING/READS] -- -- In gatekeeper store './canu_pacbio.gkpStore': -- Found 1221 reads. -- Found 28662623 bases (28.66 times coverage). -- -- Read length histogram (one '*' equals 1.38 reads): -- 0 999 1 -- 1000 1999 10 ******* -- 2000 2999 5 *** -- 3000 3999 2 * -- 4000 4999 1 -- 5000 5999 0 -- 6000 6999 1 -- 7000 7999 2 * -- 8000 8999 1 -- 9000 9999 5 *** -- 10000 10999 3 ** -- 11000 11999 6 **** -- 12000 12999 7 ***** -- 13000 13999 12 ******** -- 14000 14999 28 ******************** -- 15000 15999 92 ****************************************************************** -- 16000 16999 89 **************************************************************** -- 17000 17999 97 ********************************************************************** -- 18000 18999 80 ********************************************************* -- 19000 19999 69 ************************************************* -- 20000 20999 67 ************************************************ -- 21000 21999 61 ******************************************** -- 22000 22999 53 ************************************** -- 23000 23999 58 ***************************************** -- 24000 24999 45 ******************************** -- 25000 25999 46 ********************************* -- 26000 26999 34 ************************ -- 27000 27999 42 ****************************** -- 28000 28999 18 ************ -- 29000 29999 39 **************************** -- 30000 30999 25 ****************** -- 31000 31999 26 ****************** -- 32000 32999 27 ******************* -- 33000 33999 24 ***************** -- 34000 34999 17 ************ -- 35000 35999 11 ******* -- 36000 36999 17 ************ -- 37000 37999 10 ******* -- 38000 38999 9 ****** -- 39000 39999 13 ********* -- 40000 40999 12 ******** -- 41000 41999 9 ****** -- 42000 42999 7 ***** -- 43000 43999 8 ***** -- 44000 44999 5 *** -- 45000 45999 4 ** -- 46000 46999 6 **** -- 47000 47999 3 ** -- 48000 48999 4 ** -- 49000 49999 2 * -- 50000 50999 0 -- 51000 51999 1 -- 52000 52999 1 -- 53000 53999 3 ** -- 54000 54999 2 * -- 55000 55999 0 -- 56000 56999 0 -- 57000 57999 0 -- 58000 58999 1 [TRIMMING/MERS] -- -- 22-mers Fraction -- Occurrences NumMers Unique Total -- 1- 1 1512450 *******************************************************************--> 0.5383 0.0528 -- 2- 2 153671 *************************** 0.5930 0.0635 -- 3- 4 87685 *************** 0.6131 0.0695 -- 5- 7 45032 ******** 0.6312 0.0773 -- 8- 11 36522 ****** 0.6439 0.0858 -- 12- 16 91770 **************** 0.6574 0.1000 -- 17- 22 205924 ************************************* 0.6950 0.1562 -- 23- 29 389583 ********************************************************************** 0.7763 0.3225 -- 30- 37 231995 ***************************************** 0.9150 0.6889 -- 38- 46 39451 ******* 0.9840 0.9161 -- 47- 56 4546 0.9946 0.9593 -- 57- 67 4285 0.9962 0.9675 -- 68- 79 2489 0.9976 0.9763 -- 80- 92 1470 0.9985 0.9827 -- 93- 106 865 0.9990 0.9867 -- 107- 121 633 0.9993 0.9897 -- 122- 137 521 0.9995 0.9921 -- 138- 154 294 0.9997 0.9945 -- 155- 172 143 0.9998 0.9959 -- 173- 191 64 0.9998 0.9967 -- 192- 211 129 0.9999 0.9971 -- 212- 232 145 0.9999 0.9980 -- 233- 254 20 1.0000 0.9991 -- 255- 277 24 1.0000 0.9993 -- 278- 301 12 1.0000 0.9995 -- 302- 326 17 1.0000 0.9996 -- 327- 352 8 1.0000 0.9998 -- 353- 379 4 1.0000 0.9999 -- 380- 407 0 0.0000 0.0000 -- 408- 436 0 0.0000 0.0000 -- 437- 466 0 0.0000 0.0000 -- 467- 497 0 0.0000 0.0000 -- 498- 529 0 0.0000 0.0000 -- 530- 562 1 1.0000 1.0000 -- 563- 596 0 0.0000 0.0000 -- 597- 631 1 1.0000 1.0000 -- 632- 667 0 0.0000 0.0000 -- 668- 704 0 0.0000 0.0000 -- 705- 742 1 1.0000 1.0000 -- 743- 781 0 0.0000 0.0000 -- 782- 821 0 0.0000 0.0000 -- -- 721 (max occurrences) -- 27124532 (total mers, non-unique) -- 1297305 (distinct mers, non-unique) -- 1512450 (unique mers) [TRIMMING/TRIMMING] -- PARAMETERS: -- ---------- -- 1000 (reads trimmed below this many bases are deleted) -- 0.0450 (use overlaps at or below this fraction error) -- 1 (break region if overlap is less than this long, for 'largest covered' algorithm) -- 1 (break region if overlap coverage is less than this many read, for 'largest covered' algorithm) -- -- INPUT READS: -- ----------- -- 3666 reads 28662623 bases (reads processed) -- 0 reads 0 bases (reads not processed, previously deleted) -- 0 reads 0 bases (reads not processed, in a library where trimming isn't allowed) -- -- OUTPUT READS: -- ------------ -- 1152 reads 26749345 bases (trimmed reads output) -- 67 reads 1478377 bases (reads with no change, kept as is) -- 2447 reads 2136 bases (reads with no overlaps, deleted) -- 0 reads 0 bases (reads with short trimmed length, deleted) -- -- TRIMMING DETAILS: -- ---------------- -- 889 reads 253984 bases (bases trimmed from the 5' end of a read) -- 988 reads 178781 bases (bases trimmed from the 3' end of a read) [TRIMMING/SPLITTING] -- PARAMETERS: -- ---------- -- 1000 (reads trimmed below this many bases are deleted) -- 0.0450 (use overlaps at or below this fraction error) -- INPUT READS: -- ----------- -- 1219 reads 28660487 bases (reads processed) -- 2447 reads 2136 bases (reads not processed, previously deleted) -- 0 reads 0 bases (reads not processed, in a library where trimming isn't allowed) -- -- PROCESSED: -- -------- -- 0 reads 0 bases (no overlaps) -- 0 reads 0 bases (no coverage after adjusting for trimming done already) -- 0 reads 0 bases (processed for chimera) -- 0 reads 0 bases (processed for spur) -- 1219 reads 28660487 bases (processed for subreads) -- -- READS WITH SIGNALS: -- ------------------ -- 0 reads 0 signals (number of 5' spur signal) -- 0 reads 0 signals (number of 3' spur signal) -- 0 reads 0 signals (number of chimera signal) -- 0 reads 0 signals (number of subread signal) -- -- SIGNALS: -- ------- -- 0 reads 0 bases (size of 5' spur signal) -- 0 reads 0 bases (size of 3' spur signal) -- 0 reads 0 bases (size of chimera signal) -- 0 reads 0 bases (size of subread signal) -- -- TRIMMING: -- -------- -- 0 reads 0 bases (trimmed from the 5' end of the read) -- 0 reads 0 bases (trimmed from the 3' end of the read) [UNITIGGING/READS] -- -- In gatekeeper store './canu_pacbio.gkpStore': -- Found 1219 reads. -- Found 28227722 bases (28.22 times coverage). -- -- Read length histogram (one '*' equals 1.4 reads): -- 0 999 0 -- 1000 1999 9 ****** -- 2000 2999 5 *** -- 3000 3999 2 * -- 4000 4999 2 * -- 5000 5999 0 -- 6000 6999 2 * -- 7000 7999 2 * -- 8000 8999 4 ** -- 9000 9999 7 ***** -- 10000 10999 4 ** -- 11000 11999 10 ******* -- 12000 12999 9 ****** -- 13000 13999 14 ********** -- 14000 14999 34 ************************ -- 15000 15999 88 ************************************************************** -- 16000 16999 98 ********************************************************************** -- 17000 17999 93 ****************************************************************** -- 18000 18999 87 ************************************************************** -- 19000 19999 63 ********************************************* -- 20000 20999 63 ********************************************* -- 21000 21999 63 ********************************************* -- 22000 22999 50 *********************************** -- 23000 23999 55 *************************************** -- 24000 24999 45 ******************************** -- 25000 25999 43 ****************************** -- 26000 26999 35 ************************* -- 27000 27999 35 ************************* -- 28000 28999 23 **************** -- 29000 29999 37 ************************** -- 30000 30999 24 ***************** -- 31000 31999 31 ********************** -- 32000 32999 21 *************** -- 33000 33999 23 **************** -- 34000 34999 16 *********** -- 35000 35999 11 ******* -- 36000 36999 13 ********* -- 37000 37999 10 ******* -- 38000 38999 8 ***** -- 39000 39999 15 ********** -- 40000 40999 11 ******* -- 41000 41999 9 ****** -- 42000 42999 9 ****** -- 43000 43999 6 **** -- 44000 44999 6 **** -- 45000 45999 3 ** -- 46000 46999 5 *** -- 47000 47999 4 ** -- 48000 48999 3 ** -- 49000 49999 2 * -- 50000 50999 0 -- 51000 51999 1 -- 52000 52999 1 -- 53000 53999 3 ** -- 54000 54999 1 -- 55000 55999 0 -- 56000 56999 0 -- 57000 57999 0 -- 58000 58999 1 [UNITIGGING/MERS] -- -- 22-mers Fraction -- Occurrences NumMers Unique Total -- 1- 1 1327787 *******************************************************************--> 0.5074 0.0471 -- 2- 2 148857 ************************** 0.5643 0.0576 -- 3- 4 85815 *************** 0.5856 0.0636 -- 5- 7 44407 ******* 0.6044 0.0713 -- 8- 11 38099 ****** 0.6180 0.0800 -- 12- 16 94811 ***************** 0.6336 0.0954 -- 17- 22 208775 ************************************* 0.6742 0.1527 -- 23- 29 390024 ********************************************************************** 0.7635 0.3254 -- 30- 37 227244 **************************************** 0.9123 0.6968 -- 38- 46 35686 ****** 0.9835 0.9179 -- 47- 56 4782 0.9943 0.9596 -- 57- 67 4115 0.9961 0.9685 -- 68- 79 2393 0.9976 0.9770 -- 80- 92 1333 0.9985 0.9833 -- 93- 106 909 0.9989 0.9869 -- 107- 121 560 0.9993 0.9901 -- 122- 137 531 0.9995 0.9923 -- 138- 154 266 0.9997 0.9947 -- 155- 172 128 0.9998 0.9960 -- 173- 191 69 0.9998 0.9967 -- 192- 211 129 0.9999 0.9972 -- 212- 232 131 0.9999 0.9982 -- 233- 254 29 1.0000 0.9991 -- 255- 277 19 1.0000 0.9994 -- 278- 301 8 1.0000 0.9995 -- 302- 326 16 1.0000 0.9997 -- 327- 352 8 1.0000 0.9998 -- 353- 379 4 1.0000 0.9999 -- 380- 407 0 0.0000 0.0000 -- 408- 436 0 0.0000 0.0000 -- 437- 466 0 0.0000 0.0000 -- 467- 497 0 0.0000 0.0000 -- 498- 529 0 0.0000 0.0000 -- 530- 562 1 1.0000 1.0000 -- 563- 596 0 0.0000 0.0000 -- 597- 631 1 1.0000 1.0000 -- 632- 667 0 0.0000 0.0000 -- 668- 704 1 1.0000 1.0000 -- 705- 742 0 0.0000 0.0000 -- 743- 781 0 0.0000 0.0000 -- 782- 821 0 0.0000 0.0000 -- -- 693 (max occurrences) -- 26874336 (total mers, non-unique) -- 1289151 (distinct mers, non-unique) -- 1327787 (unique mers) [UNITIGGING/OVERLAPS] -- category reads % read length feature size or coverage analysis -- ---------------- ------- ------- ---------------------- ------------------------ -------------------- -- middle-missing 0 0.00 0.00 +- 0.00 0.00 +- 0.00 (bad trimming) -- middle-hump 0 0.00 0.00 +- 0.00 0.00 +- 0.00 (bad trimming) -- no-5-prime 1 0.08 27774.00 +- 0.00 0.00 +- 0.00 (bad trimming) -- no-3-prime 1 0.08 36615.00 +- 0.00 0.00 +- 0.00 (bad trimming) -- -- low-coverage 0 0.00 0.00 +- 0.00 0.00 +- 0.00 (easy to assemble, potential for lower quality consensus) -- unique 1173 96.23 23010.17 +- 8557.97 28.03 +- 6.19 (easy to assemble, perfect, yay) -- repeat-cont 0 0.00 0.00 +- 0.00 0.00 +- 0.00 (potential for consensus errors, no impact on assembly) -- repeat-dove 0 0.00 0.00 +- 0.00 0.00 +- 0.00 (hard to assemble, likely won't assemble correctly or even at all) -- -- span-repeat 20 1.64 25678.20 +- 9078.95 1071.80 +- 1997.86 (read spans a large repeat, usually easy to assemble) -- uniq-repeat-cont 21 1.72 25507.00 +- 7598.47 (should be uniquely placed, low potential for consensus errors, no impact on assembly) -- uniq-repeat-dove 3 0.25 41063.00 +- 8735.61 (will end contigs, potential to misassemble) -- uniq-anchor 0 0.00 0.00 +- 0.00 0.00 +- 0.00 (repeat read, with unique section, probable bad read) [UNITIGGING/ADJUSTMENT] -- No report available. [UNITIGGING/CONTIGS] -- Found, in version 1, after unitig construction: -- contigs: 2 sequences, total length 1042744 bp (including 0 repeats of total length 0 bp). -- bubbles: 0 sequences, total length 0 bp. -- unassembled: 101 sequences, total length 2168096 bp. -- -- Contig sizes based on genome size -- -- NG (bp) LG (contigs) sum (bp) -- ---------- ------------ ---------- -- 10 919684 1 919684 -- 20 919684 1 919684 -- 30 919684 1 919684 -- 40 919684 1 919684 -- 50 919684 1 919684 -- 60 919684 1 919684 -- 70 919684 1 919684 -- 80 919684 1 919684 -- 90 919684 1 919684 -- 100 123060 2 1042744 -- [UNITIGGING/CONSENSUS] -- Found, in version 2, after consensus generation: -- contigs: 2 sequences, total length 1042738 bp (including 0 repeats of total length 0 bp). -- bubbles: 0 sequences, total length 0 bp. -- unassembled: 101 sequences, total length 2168096 bp. -- -- Contig sizes based on genome size -- -- NG (bp) LG (contigs) sum (bp) -- ---------- ------------ ---------- -- 10 919604 1 919604 -- 20 919604 1 919604 -- 30 919604 1 919604 -- 40 919604 1 919604 -- 50 919604 1 919604 -- 60 919604 1 919604 -- 70 919604 1 919604 -- 80 919604 1 919604 -- 90 919604 1 919604 -- 100 123134 2 1042738 --
Run Canu on the Nanopore data
Change the command in such a way that it will use the Nanopore data for assembly. Follow the process of Canu. What do you see? Find the expected coverage after each step and try to explain what is happening.
Solution
./tools/canu-1.7.1/Linux-amd64/bin/canu -s results/canu.spec -nanopore-raw ./data/raw_data/nanopore_reads.fastq -p canu_nanopore -d results/canu_nanopore genomeSize=1M
Log file
[CORRECTION/READS] -- -- In gatekeeper store './canu_nanopore.gkpStore': -- Found 1412 reads. -- Found 20712153 bases (20.71 times coverage). -- -- Read length histogram (one '*' equals 1.82 reads): -- 0 999 0 -- 1000 1999 128 ********************************************************************** -- 2000 2999 73 *************************************** -- 3000 3999 73 *************************************** -- 4000 4999 60 ******************************** -- 5000 5999 70 ************************************** -- 6000 6999 56 ****************************** -- 7000 7999 54 ***************************** -- 8000 8999 47 ************************* -- 9000 9999 49 ************************** -- 10000 10999 55 ****************************** -- 11000 11999 38 ******************** -- 12000 12999 53 **************************** -- 13000 13999 50 *************************** -- 14000 14999 39 ********************* -- 15000 15999 41 ********************** -- 16000 16999 29 *************** -- 17000 17999 41 ********************** -- 18000 18999 26 ************** -- 19000 19999 36 ******************* -- 20000 20999 35 ******************* -- 21000 21999 17 ********* -- 22000 22999 43 *********************** -- 23000 23999 27 ************** -- 24000 24999 27 ************** -- 25000 25999 17 ********* -- 26000 26999 18 ********* -- 27000 27999 21 *********** -- 28000 28999 18 ********* -- 29000 29999 10 ***** -- 30000 30999 12 ****** -- 31000 31999 18 ********* -- 32000 32999 13 ******* -- 33000 33999 12 ****** -- 34000 34999 10 ***** -- 35000 35999 9 **** -- 36000 36999 6 *** -- 37000 37999 7 *** -- 38000 38999 10 ***** -- 39000 39999 11 ****** -- 40000 40999 4 ** -- 41000 41999 12 ****** -- 42000 42999 5 ** -- 43000 43999 4 ** -- 44000 44999 0 -- 45000 45999 5 ** -- 46000 46999 4 ** -- 47000 47999 2 * -- 48000 48999 0 -- 49000 49999 3 * -- 50000 50999 1 -- 51000 51999 2 * -- 52000 52999 0 -- 53000 53999 2 * -- 54000 54999 2 * -- 55000 55999 0 -- 56000 56999 3 * -- 57000 57999 0 -- 58000 58999 1 -- 59000 59999 0 -- 60000 60999 1 -- 61000 61999 0 -- 62000 62999 0 -- 63000 63999 1 -- 64000 64999 0 -- 65000 65999 0 -- 66000 66999 0 -- 67000 67999 0 -- 68000 68999 0 -- 69000 69999 0 -- 70000 70999 0 -- 71000 71999 0 -- 72000 72999 0 -- 73000 73999 1 [CORRECTION/MERS] -- -- 16-mers Fraction -- Occurrences NumMers Unique Total -- 1- 1 12378672 *******************************************************************--> 0.8604 0.5983 -- 2- 2 878546 ********************************************************************** 0.9214 0.6832 -- 3- 4 494419 *************************************** 0.9424 0.7269 -- 5- 7 374801 ***************************** 0.9663 0.8006 -- 8- 11 205755 **************** 0.9873 0.9005 -- 12- 16 47074 *** 0.9975 0.9719 -- 17- 22 5756 0.9996 0.9924 -- 23- 29 1472 0.9999 0.9964 -- 30- 37 519 0.9999 0.9980 -- 38- 46 197 1.0000 0.9987 -- 47- 56 72 1.0000 0.9991 -- 57- 67 48 1.0000 0.9992 -- 68- 79 26 1.0000 0.9994 -- 80- 92 18 1.0000 0.9995 -- 93- 106 12 1.0000 0.9995 -- 107- 121 7 1.0000 0.9996 -- 122- 137 4 1.0000 0.9996 -- 138- 154 5 1.0000 0.9997 -- 155- 172 2 1.0000 0.9997 -- 173- 191 3 1.0000 0.9997 -- 192- 211 3 1.0000 0.9997 -- 212- 232 2 1.0000 0.9998 -- 233- 254 2 1.0000 0.9998 -- 255- 277 1 1.0000 0.9998 -- 278- 301 2 1.0000 0.9998 -- 302- 326 3 1.0000 0.9999 -- 327- 352 2 1.0000 0.9999 -- 353- 379 0 0.0000 0.0000 -- 380- 407 0 0.0000 0.0000 -- 408- 436 0 0.0000 0.0000 -- 437- 466 0 0.0000 0.0000 -- 467- 497 0 0.0000 0.0000 -- 498- 529 0 0.0000 0.0000 -- 530- 562 0 0.0000 0.0000 -- 563- 596 0 0.0000 0.0000 -- 597- 631 0 0.0000 0.0000 -- 632- 667 0 0.0000 0.0000 -- 668- 704 0 0.0000 0.0000 -- 705- 742 0 0.0000 0.0000 -- 743- 781 0 0.0000 0.0000 -- 782- 821 0 0.0000 0.0000 -- -- 838 (max occurrences) -- 8312301 (total mers, non-unique) -- 2008753 (distinct mers, non-unique) -- 12378672 (unique mers) [CORRECTION/LAYOUT] -- original original -- raw reads raw reads -- category w/overlaps w/o/overlaps -- -------------------- ------------- ------------- -- Number of Reads 1395 17 -- Number of Bases 20432047 1293 -- Coverage 20.432 0.001 -- Median 12018 0 -- Mean 14646 76 -- N50 22421 19091 -- Minimum 1000 0 -- Maximum 73689 1293 -- -- --------corrected--------- ----------rescued---------- -- evidence expected expected -- category reads raw corrected raw corrected -- -------------------- ------------- ------------- ------------- ------------- ------------- -- Number of Reads 1395 1357 1357 0 0 -- Number of Bases 20421472 20150737 19623445 0 0 -- Coverage 20.421 20.151 19.623 0.000 0.000 -- Median 12018 12224 11911 0 0 -- Mean 14639 14849 14460 0 0 -- N50 22421 22618 22208 0 0 -- Minimum 1000 1000 6 0 0 -- Maximum 73689 73689 73274 0 0 -- -- --------uncorrected-------- -- expected -- category raw corrected -- -------------------- ------------- ------------- -- Number of Reads 55 55 -- Number of Bases 282603 0 -- Coverage 0.283 0.000 -- Median 1783 0 -- Mean 5138 0 -- N50 18030 0 -- Minimum 0 0 -- Maximum 38417 0 -- -- Maximum Memory 1211898854 [TRIMMING/READS] -- -- In gatekeeper store './canu_nanopore.gkpStore': -- Found 1315 reads. -- Found 19878314 bases (19.87 times coverage). -- -- Read length histogram (one '*' equals 1.35 reads): -- 0 999 3 ** -- 1000 1999 95 ********************************************************************** -- 2000 2999 64 *********************************************** -- 3000 3999 62 ********************************************* -- 4000 4999 57 ****************************************** -- 5000 5999 62 ********************************************* -- 6000 6999 53 *************************************** -- 7000 7999 49 ************************************ -- 8000 8999 54 *************************************** -- 9000 9999 47 ********************************** -- 10000 10999 48 *********************************** -- 11000 11999 36 ************************** -- 12000 12999 54 *************************************** -- 13000 13999 40 ***************************** -- 14000 14999 43 ******************************* -- 15000 15999 33 ************************ -- 16000 16999 37 *************************** -- 17000 17999 33 ************************ -- 18000 18999 34 ************************* -- 19000 19999 28 ******************** -- 20000 20999 30 ********************** -- 21000 21999 25 ****************** -- 22000 22999 30 ********************** -- 23000 23999 34 ************************* -- 24000 24999 24 ***************** -- 25000 25999 22 **************** -- 26000 26999 20 ************** -- 27000 27999 14 ********** -- 28000 28999 20 ************** -- 29000 29999 14 ********** -- 30000 30999 7 ***** -- 31000 31999 17 ************ -- 32000 32999 14 ********** -- 33000 33999 6 **** -- 34000 34999 12 ******** -- 35000 35999 12 ******** -- 36000 36999 8 ***** -- 37000 37999 3 ** -- 38000 38999 10 ******* -- 39000 39999 7 ***** -- 40000 40999 8 ***** -- 41000 41999 5 *** -- 42000 42999 8 ***** -- 43000 43999 3 ** -- 44000 44999 7 ***** -- 45000 45999 0 -- 46000 46999 5 *** -- 47000 47999 1 -- 48000 48999 2 * -- 49000 49999 1 -- 50000 50999 2 * -- 51000 51999 0 -- 52000 52999 2 * -- 53000 53999 2 * -- 54000 54999 0 -- 55000 55999 1 -- 56000 56999 2 * -- 57000 57999 1 -- 58000 58999 0 -- 59000 59999 1 -- 60000 60999 0 -- 61000 61999 1 -- 62000 62999 0 -- 63000 63999 0 -- 64000 64999 1 -- 65000 65999 0 -- 66000 66999 0 -- 67000 67999 0 -- 68000 68999 0 -- 69000 69999 0 -- 70000 70999 1 [TRIMMING/MERS] -- -- 22-mers Fraction -- Occurrences NumMers Unique Total -- 1- 1 3562896 *******************************************************************--> 0.6506 0.1795 -- 2- 2 499386 ********************************************************************** 0.7417 0.2298 -- 3- 4 355037 ************************************************* 0.7817 0.2629 -- 5- 7 251333 *********************************** 0.8250 0.3157 -- 8- 11 229832 ******************************** 0.8638 0.3898 -- 12- 16 255357 *********************************** 0.9040 0.5056 -- 17- 22 227442 ******************************* 0.9496 0.6941 -- 23- 29 84470 *********** 0.9868 0.9002 -- 30- 37 8130 * 0.9986 0.9848 -- 38- 46 1408 0.9995 0.9932 -- 47- 56 716 0.9998 0.9960 -- 57- 67 267 0.9999 0.9977 -- 68- 79 189 0.9999 0.9985 -- 80- 92 89 1.0000 0.9992 -- 93- 106 45 1.0000 0.9995 -- 107- 121 16 1.0000 0.9997 -- 122- 137 10 1.0000 0.9998 -- 138- 154 7 1.0000 0.9999 -- 155- 172 3 1.0000 0.9999 -- 173- 191 1 1.0000 1.0000 -- 192- 211 0 0.0000 0.0000 -- 212- 232 0 0.0000 0.0000 -- 233- 254 0 0.0000 0.0000 -- 255- 277 0 0.0000 0.0000 -- 278- 301 0 0.0000 0.0000 -- 302- 326 0 0.0000 0.0000 -- 327- 352 0 0.0000 0.0000 -- 353- 379 0 0.0000 0.0000 -- 380- 407 0 0.0000 0.0000 -- 408- 436 2 1.0000 1.0000 -- 437- 466 0 0.0000 0.0000 -- 467- 497 0 0.0000 0.0000 -- 498- 529 0 0.0000 0.0000 -- 530- 562 0 0.0000 0.0000 -- 563- 596 0 0.0000 0.0000 -- 597- 631 0 0.0000 0.0000 -- 632- 667 0 0.0000 0.0000 -- 668- 704 0 0.0000 0.0000 -- 705- 742 0 0.0000 0.0000 -- 743- 781 0 0.0000 0.0000 -- 782- 821 0 0.0000 0.0000 -- -- 410 (max occurrences) -- 16287803 (total mers, non-unique) -- 1913740 (distinct mers, non-unique) -- 3562896 (unique mers) [TRIMMING/TRIMMING] -- PARAMETERS: -- ---------- -- 1000 (reads trimmed below this many bases are deleted) -- 0.1440 (use overlaps at or below this fraction error) -- 1 (break region if overlap is less than this long, for 'largest covered' algorithm) -- 1 (break region if overlap coverage is less than this many read, for 'largest covered' algorithm) -- -- INPUT READS: -- ----------- -- 1412 reads 19878314 bases (reads processed) -- 0 reads 0 bases (reads not processed, previously deleted) -- 0 reads 0 bases (reads not processed, in a library where trimming isn't allowed) -- -- OUTPUT READS: -- ------------ -- 1082 reads 16625139 bases (trimmed reads output) -- 228 reads 2963515 bases (reads with no change, kept as is) -- 102 reads 5692 bases (reads with no overlaps, deleted) -- 0 reads 0 bases (reads with short trimmed length, deleted) -- -- TRIMMING DETAILS: -- ---------------- -- 626 reads 125849 bases (bases trimmed from the 5' end of a read) -- 868 reads 158119 bases (bases trimmed from the 3' end of a read) [TRIMMING/SPLITTING] -- PARAMETERS: -- ---------- -- 1000 (reads trimmed below this many bases are deleted) -- 0.1440 (use overlaps at or below this fraction error) -- INPUT READS: -- ----------- -- 1310 reads 19872622 bases (reads processed) -- 102 reads 5692 bases (reads not processed, previously deleted) -- 0 reads 0 bases (reads not processed, in a library where trimming isn't allowed) -- -- PROCESSED: -- -------- -- 0 reads 0 bases (no overlaps) -- 0 reads 0 bases (no coverage after adjusting for trimming done already) -- 0 reads 0 bases (processed for chimera) -- 0 reads 0 bases (processed for spur) -- 1310 reads 19872622 bases (processed for subreads) -- -- READS WITH SIGNALS: -- ------------------ -- 0 reads 0 signals (number of 5' spur signal) -- 0 reads 0 signals (number of 3' spur signal) -- 0 reads 0 signals (number of chimera signal) -- 0 reads 0 signals (number of subread signal) -- -- SIGNALS: -- ------- -- 0 reads 0 bases (size of 5' spur signal) -- 0 reads 0 bases (size of 3' spur signal) -- 0 reads 0 bases (size of chimera signal) -- 0 reads 0 bases (size of subread signal) -- -- TRIMMING: -- -------- -- 0 reads 0 bases (trimmed from the 5' end of the read) -- 0 reads 0 bases (trimmed from the 3' end of the read) [UNITIGGING/READS] -- -- In gatekeeper store './canu_nanopore.gkpStore': -- Found 1310 reads. -- Found 19588654 bases (19.58 times coverage). -- -- Read length histogram (one '*' equals 1.38 reads): -- 0 999 0 -- 1000 1999 97 ********************************************************************** -- 2000 2999 68 ************************************************* -- 3000 3999 65 ********************************************** -- 4000 4999 60 ******************************************* -- 5000 5999 54 ************************************** -- 6000 6999 52 ************************************* -- 7000 7999 51 ************************************ -- 8000 8999 50 ************************************ -- 9000 9999 53 ************************************** -- 10000 10999 52 ************************************* -- 11000 11999 36 ************************* -- 12000 12999 55 *************************************** -- 13000 13999 36 ************************* -- 14000 14999 42 ****************************** -- 15000 15999 35 ************************* -- 16000 16999 35 ************************* -- 17000 17999 34 ************************ -- 18000 18999 30 ********************* -- 19000 19999 28 ******************** -- 20000 20999 28 ******************** -- 21000 21999 25 ****************** -- 22000 22999 34 ************************ -- 23000 23999 29 ******************** -- 24000 24999 25 ****************** -- 25000 25999 21 *************** -- 26000 26999 20 ************** -- 27000 27999 16 *********** -- 28000 28999 17 ************ -- 29000 29999 15 ********** -- 30000 30999 8 ***** -- 31000 31999 15 ********** -- 32000 32999 14 ********** -- 33000 33999 8 ***** -- 34000 34999 10 ******* -- 35000 35999 12 ******** -- 36000 36999 9 ****** -- 37000 37999 2 * -- 38000 38999 11 ******* -- 39000 39999 6 **** -- 40000 40999 7 ***** -- 41000 41999 7 ***** -- 42000 42999 8 ***** -- 43000 43999 2 * -- 44000 44999 7 ***** -- 45000 45999 0 -- 46000 46999 4 ** -- 47000 47999 3 ** -- 48000 48999 2 * -- 49000 49999 1 -- 50000 50999 2 * -- 51000 51999 1 -- 52000 52999 1 -- 53000 53999 1 -- 54000 54999 0 -- 55000 55999 1 -- 56000 56999 1 -- 57000 57999 1 -- 58000 58999 0 -- 59000 59999 0 -- 60000 60999 0 -- 61000 61999 1 -- 62000 62999 0 -- 63000 63999 0 -- 64000 64999 1 -- 65000 65999 0 -- 66000 66999 0 -- 67000 67999 0 -- 68000 68999 0 -- 69000 69999 0 -- 70000 70999 1 [UNITIGGING/MERS] -- -- 22-mers Fraction -- Occurrences NumMers Unique Total -- 1- 1 3408259 *******************************************************************--> 0.6423 0.1742 -- 2- 2 491958 ********************************************************************** 0.7350 0.2245 -- 3- 4 351719 ************************************************** 0.7758 0.2577 -- 5- 7 249826 *********************************** 0.8202 0.3110 -- 8- 11 230059 ******************************** 0.8601 0.3859 -- 12- 16 257311 ************************************ 0.9017 0.5038 -- 17- 22 225199 ******************************** 0.9490 0.6959 -- 23- 29 81678 *********** 0.9868 0.9023 -- 30- 37 7640 * 0.9986 0.9854 -- 38- 46 1361 0.9995 0.9935 -- 47- 56 689 0.9998 0.9962 -- 57- 67 241 0.9999 0.9979 -- 68- 79 190 0.9999 0.9985 -- 80- 92 80 1.0000 0.9992 -- 93- 106 50 1.0000 0.9996 -- 107- 121 14 1.0000 0.9998 -- 122- 137 8 1.0000 0.9999 -- 138- 154 4 1.0000 0.9999 -- 155- 172 0 0.0000 0.0000 -- 173- 191 1 1.0000 1.0000 -- 192- 211 0 0.0000 0.0000 -- 212- 232 0 0.0000 0.0000 -- 233- 254 0 0.0000 0.0000 -- 255- 277 0 0.0000 0.0000 -- 278- 301 0 0.0000 0.0000 -- 302- 326 0 0.0000 0.0000 -- 327- 352 0 0.0000 0.0000 -- 353- 379 0 0.0000 0.0000 -- 380- 407 2 1.0000 1.0000 -- 408- 436 0 0.0000 0.0000 -- 437- 466 0 0.0000 0.0000 -- 467- 497 0 0.0000 0.0000 -- 498- 529 0 0.0000 0.0000 -- 530- 562 0 0.0000 0.0000 -- 563- 596 0 0.0000 0.0000 -- 597- 631 0 0.0000 0.0000 -- 632- 667 0 0.0000 0.0000 -- 668- 704 0 0.0000 0.0000 -- 705- 742 0 0.0000 0.0000 -- 743- 781 0 0.0000 0.0000 -- 782- 821 0 0.0000 0.0000 -- -- 406 (max occurrences) -- 16152885 (total mers, non-unique) -- 1898030 (distinct mers, non-unique) -- 3408259 (unique mers) [UNITIGGING/OVERLAPS] -- category reads % read length feature size or coverage analysis -- ---------------- ------- ------- ---------------------- ------------------------ -------------------- -- middle-missing 0 0.00 0.00 +- 0.00 0.00 +- 0.00 (bad trimming) -- middle-hump 0 0.00 0.00 +- 0.00 0.00 +- 0.00 (bad trimming) -- no-5-prime 2 0.15 27410.00 +- 3170.67 0.00 +- 0.00 (bad trimming) -- no-3-prime 0 0.00 0.00 +- 0.00 0.00 +- 0.00 (bad trimming) -- -- low-coverage 1 0.08 1133.00 +- 0.00 3.00 +- 0.00 (easy to assemble, potential for lower quality consensus) -- unique 1174 89.62 14075.36 +- 10983.06 19.55 +- 4.41 (easy to assemble, perfect, yay) -- repeat-cont 1 0.08 1440.00 +- 0.00 34.82 +- 0.50 (potential for consensus errors, no impact on assembly) -- repeat-dove 0 0.00 0.00 +- 0.00 0.00 +- 0.00 (hard to assemble, likely won't assemble correctly or even at all) -- -- span-repeat 110 8.40 23153.85 +- 13327.07 3478.01 +- 3049.31 (read spans a large repeat, usually easy to assemble) -- uniq-repeat-cont 18 1.37 17150.72 +- 10267.11 (should be uniquely placed, low potential for consensus errors, no impact on assembly) -- uniq-repeat-dove 4 0.31 37788.00 +- 6565.50 (will end contigs, potential to misassemble) -- uniq-anchor 0 0.00 0.00 +- 0.00 0.00 +- 0.00 (repeat read, with unique section, probable bad read) [UNITIGGING/ADJUSTMENT] -- No report available. [UNITIGGING/CONTIGS] -- Found, in version 1, after unitig construction: -- contigs: 1 sequences, total length 985432 bp (including 0 repeats of total length 0 bp). -- bubbles: 0 sequences, total length 0 bp. -- unassembled: 192 sequences, total length 2190220 bp. -- -- Contig sizes based on genome size -- -- NG (bp) LG (contigs) sum (bp) -- ---------- ------------ ---------- -- 10 985432 1 985432 -- 20 985432 1 985432 -- 30 985432 1 985432 -- 40 985432 1 985432 -- 50 985432 1 985432 -- 60 985432 1 985432 -- 70 985432 1 985432 -- 80 985432 1 985432 -- 90 985432 1 985432 -- [UNITIGGING/CONSENSUS] -- Found, in version 2, after consensus generation: -- contigs: 1 sequences, total length 1001134 bp (including 0 repeats of total length 0 bp). -- bubbles: 0 sequences, total length 0 bp. -- unassembled: 192 sequences, total length 2190220 bp. -- -- Contig sizes based on genome size -- -- NG (bp) LG (contigs) sum (bp) -- ---------- ------------ ---------- -- 10 1001134 1 1001134 -- 20 1001134 1 1001134 -- 30 1001134 1 1001134 -- 40 1001134 1 1001134 -- 50 1001134 1 1001134 -- 60 1001134 1 1001134 -- 70 1001134 1 1001134 -- 80 1001134 1 1001134 -- 90 1001134 1 1001134 -- 100 1001134 1 1001134 --
Short read assembly
To assembly the Illumina data we will use the Platanus assembler:
./tools/platanus/platanus assemble -k 21 -m 4 -t 4 -f ./data/raw_data/illumina_R1.fastq ./data/raw_data/illumina_R2.fastq -o results/illumina_assembly
The application might not be executable. If that is the case, run:
chmod +x ~/tools/platanus/platanus
Platanus assembler
While the Platanus assembler is running, investigate the following items:
- What does the -k stand for?
- How do k-mers relate to heterozygosity, ploidy or repeat content?
Basic statistics on the assemblies
Get statistics
Use the assembly-stats program from the previous session to get the statistics on all three assemblies.
PacBio assembly
./tools/assembly-stats-master/build/assembly-stats ./results/canu_pacbio/canu_pacbio.contigs.fasta stats for ./results/canu_pacbio/canu_pacbio.contigs.fasta sum = 1042738, n = 2, ave = 521369.00, largest = 919604 N50 = 919604, n = 1 N60 = 919604, n = 1 N70 = 919604, n = 1 N80 = 919604, n = 1 N90 = 123134, n = 2 N100 = 123134, n = 2 N_count = 0 Gaps = 0
Nanopore assembly
./tools/assembly-stats-master/build/assembly-stats ./results/canu_nanopore/canu_nanopore.contigs.fasta stats for ./results/canu_nanopore/canu_nanopore.contigs.fasta sum = 1001134, n = 1, ave = 1001134.00, largest = 1001134 N50 = 1001134, n = 1 N60 = 1001134, n = 1 N70 = 1001134, n = 1 N80 = 1001134, n = 1 N90 = 1001134, n = 1 N100 = 1001134, n = 1 N_count = 0 Gaps = 0
Illumina assembly
./tools/assembly-stats-master/build/assembly-stats ./results/illumina_assembly_contig.fa stats for ./results/illumina_assembly_contig.fa sum = 1024211, n = 382, ave = 2681.18, largest = 146639 N50 = 55012, n = 6 N60 = 42739, n = 8 N70 = 37838, n = 10 N80 = 27857, n = 13 N90 = 3330, n = 28 N100 = 91, n = 382 N_count = 0 Gaps = 0
Discuss the results and make a comparison of the three. Do the results match your expectations? Which one do you prefer and why?
Additional challenge
Play around with some of the settings of Canu and check the effects on the process and the end results. Change for example the minimum read length or the error correction values.
Key Points
Each platform and each data sets requires hands-on work with the assembler
Used applications
./tools/canu-1.7.1/Linux-amd64/bin/canu
./tools/platanus/platanus