Assembling a genome from short reads (e.g. Illumina) using SPAdes#

Acknowledgements

This walkthrough was adapted from Conor Meehan's Pathogen genomics course under a CC BY-NC-SA 4.0 license. The walkthrough was modified to include CLIMB-BIG-DATA specific intructions and additional explanation of certain steps.

De novo genome assembly is a process used to reconstruct the complete genome sequence of an organism without the need for a reference genome. In the case of bacterial genomes, de novo assembly involves piecing together short DNA sequences, called reads, obtained from high-throughput sequencing technologies. In this walkthrough, you will learn how to use SPAdes to assemble short reads into a genome. If you would like to read more about genome assembly, there is more information here.

Other tools associated with SPAdes can be used to assemble metagenomes sequenced using short read technologies (metaSPAdes) or viral genomes (e.g. rnaviralSPAdes or coronaSPAdes). This is not covered here but is outlined in the SPAdes github and associated metaSPAdes and coronaSPAdes papers.

We will:

Manage our software environment using conda
Download some sequenced reads into your CLIMB-BIG-DATA session
Trim the reads to remove low quality bases and reads with trimmomatic
Assemble the reads into a genome using SPAdes

This demonstration uses one of the samples from Hikichi et al which are the DRR187559_1.fastqsanger.bz2 and DRR187559_2.fastqsanger.bz2 files from this zenodo record

BEGIN#

Launch terminal#

Select File -> New -> Terminal, or click the Terminal icon on the launcher pane. You'll get a new terminal tab in the activity bar, and find yourself in a bash shell.

Terminal

This will open a new terminal session

Terminal

Create a folder to store the sequenced read data and the genome assembly#

Create a folder called spades-demo followed by your name in the shared-team directory, (i.e. spades-demo-rupert) and move into it. This is where we will store the data and the genome assembly.

mkdir ~/shared-team/spades-demo-rupert
cd ~/shared-team/spades-demo-rupert

Download the fastq raw data into this folder. These are somewhat large files and will take about one minute.

wget https://zenodo.org/record/4534098/files/DRR187559_1.fastqsanger.bz2
wget https://zenodo.org/record/4534098/files/DRR187559_2.fastqsanger.bz2

Info

Remember you can copy-paste commands from this walkthrough into the terminal.

Trimmomatic and SPAdes prefer gzipped files (files that end in .gz) but our downloaded file is a bzip2 file (ends in .bz2). We need to convert these types before we proceed. We will also rename them into more conventional naming scheme for illumina reads (that is, they end in _1.fastq.gz and _2.fastq.gz)

bzcat DRR187559_1.fastqsanger.bz2 | gzip -c > DRR187559_1.fastq.gz
bzcat DRR187559_2.fastqsanger.bz2 | gzip -c > DRR187559_2.fastq.gz

Most sequencers produce .gz files so this is not necessary if your file is already in that format.

Use ls to check that the files are there and have the correct names

ls -ahl

Which should give you something like this:

jovyan:~/shared-team/spades-demo-rupert$ ls -ahl 
total 272M
drwxr-xr-x 2 jovyan users   4 Jul 20 17:17 .
drwx------ 7 jovyan  1000   9 Jul 20 17:13 ..
-rw-r--r-- 1 jovyan users 73M Jul 20 17:17 DRR187559_1.fastq.gz
-rw-r--r-- 1 jovyan users 61M Jul 13 21:00 DRR187559_1.fastqsanger.bz2
-rw-r--r-- 1 jovyan users 76M Jul 20 17:17 DRR187559_2.fastq.gz
-rw-r--r-- 1 jovyan users 63M Jul 13 21:00 DRR187559_2.fastqsanger.bz2

Install SPAdes and trimmomatic using Conda#

It is recommended to always install packages for a project in their own environments so here will we create an enironment and install SPAdes in one step.

conda create --solver libmamba -y -n assembly -c bioconda spades trimmomatic
conda activate assembly

Warning

Remember you can not install packages in the base Conda environment. You must create a new environment and activate it first.

Use trimmomatic to filter out bad reads and trim the ends as needed#

The order of commands for paired end reads is:

trimmomatic PE <forward reads input name> 
               <reverse reads input name> 
               <trimmed forward reads out name> 
               <removed forward reads out name> 
               <trimmed reverse reads out name> 
               <removed reverse reads out name> 
               <trimming options>

We will trim the reads in the follow ways:

Remove all reads shorter than 30bp MINLEN:30
Remove bases from the start of a read if they are below a quality score of 20, LEADING:20
Remove bases from the END of a read if they are below a quality score of 20, TRAILING:20
Use a sliding window of 4 bases across the read, removing any set of 4 bases where the average score drops below 20, SLIDINGWINDOW:4:20

trimmomatic PE DRR187559_1.fastq.gz DRR187559_2.fastq.gz DRR187559_trimmed_1.fastq.gz DRR187559_removed_1.fastq.gz DRR187559_trimmed_2.fastq.gz DRR187559_removed_2.fastq.gz  MINLEN:30 LEADING:20 TRAILING:20 SLIDINGWINDOW:4:20

trimmomatic running

At this stage you should use Fastqc to look at the read quality and see if it is ok, but we will skip this step as it is not always needed.

Assemble the genome using SPAdes#

The -1 and -2 flags are the forward and reverse trimmed reads that came from trimmomatic, respectively.
-ois the anme of a directory to store the output in
-t is the number of threads to use. This should be the number of cores you have available.
-m is the amount of memory that SPAdes can use. This should be the amount of memory you have available.

spades.py -1 DRR187559_trimmed_1.fastq.gz -2 DRR187559_trimmed_2.fastq.gz -o DRR187559_spades -t 6 -m 8

Examining the output#

SPAdes will mention all the output files in the log. All the output will be in the folder we specified with the -o flag.

===== Terminate finished. 

 * Corrected reads are in /shared/team/spades-demo-rupert/DRR187559_spades/corrected/
 * Assembled contigs are in /shared/team/spades-demo-rupert/DRR187559_spades/contigs.fasta
 * Assembled scaffolds are in /shared/team/spades-demo-rupert/DRR187559_spades/scaffolds.fasta
 * Paths in the assembly graph corresponding to the contigs are in /shared/team/spades-demo-rupert/DRR187559_spades/contigs.paths
 * Paths in the assembly graph corresponding to the scaffolds are in /shared/team/spades-demo-rupert/DRR187559_spades/scaffolds.paths
 * Assembly graph is in /shared/team/spades-demo-rupert/DRR187559_spades/assembly_graph.fastg
 * Assembly graph in GFA format is in /shared/team/spades-demo-rupert/DRR187559_spades/assembly_graph_with_scaffolds.gfa

The important output files:

scaffolds.fasta: The final assembly you should use
assembly_graph.fastg: Assembly graph
spades.log: Full log file for bug reporting

Use the head command to have a look at the final output assembly file:

head DRR187559_spades/scaffolds.fasta

Which should give you something like this:

>NODE_1_length_419723_cov_11.898846
TACAAAGGAGAAAAAAAGAAGACAACCAAGCCCAATAATGGACTGGCCGCCTAATAATAA
AAACTCTAAAAGTTGTAATTTAAAATAGTTCTTTAAATTATATACCCACCACATTTGGTG
GAGAACCAAAAATTAGCCGAAAAACATCATTTCTGAAGTTATCGGCTAAAGTTATAAATT
ATATTTATTTGTACATGAACAAATAATTTACATTAATTTGTCATTTCTTCTTTTTCCCAA
TCGATTTTATATCTTTCTGAAGAACGATCTGTCCATTTATCTTTAGTATTGGTACCTTTC
CAATTTGTTGAAGTCCAATGCAATTGGTAGTCATCACGAACTCGTTCGTATATTACATCT
ATATTTGTTTGTTGTTTGGATGCTTTTCTATCCATAGTAATAACTGTAGCGAAGTCTGGT
GAAAACCCTGAAGATAATAGAGAACTTGCTTTGTTAGGATCAAGGAAGTTCTCTGCTGCT
TTCATAGAACCATTTCTAGTTTTCATGAAAAGTTGATTGCCATATACCGGGTTCCAAGAA

Deactivate your Conda environment when you are finished

conda deactivate

Post assembly steps#

Once assembly is finished you can do a more in depth check of the quality and completeness using the BUSCO and Bandage. You can get basic metrics such as N50 using the tool Quast which can be downloaded via Conda