Use cases

Choosing an assembly strategy

SqueezeMeta can be run in four different assembly modes, depending on the type of multi-metagenome support. These modes are:

Sequential mode: All samples are treated individually and analysed sequentially.
Coassembly mode: Reads from all samples are pooled and a single assembly is performed. Then reads from individual samples are mapped to the coassembly to obtain gene abundances in each sample. Binning methods allow to obtain genome bins.
Merged mode: if many big samples are available, co-assembly could crash because of memory requirements. This mode achieves a comparable result with a procedure inspired by the one used by Benjamin Tully for analysing TARA Oceans data. Briefly, samples are assembled individually and the resulting contigs are merged in a single co-assembly. Then the analysis proceeds as in the co-assembly mode. This is not the recommended procedure (use co-assembly if possible) since the possibility of creating chimeric contigs is higher. But it is a viable alternative in smaller computers in which standard co-assembly is not feasible.
Seqmerge mode: This is intended to work with more samples than the merged mode. Instead of merging all individual assemblies in a single step, which can be very computationally demanding, seqmerge works sequentially. First, it assembles individually all samples, as in merged mode. But then it will merge the two most similar assemblies. Similarity is measured as Amino Acid Identity values using the wonderful CompareM software by Donovan Parks. After this first merging, it again evaluates similarity and merge, and proceeds this way until all metagenomes have been merged in one. Therefore, for n metagenomes, it will need n-1 merging steps.

Note

Note that the merged and seqmerge modes work well as a substitute of coassembly for running small datasets in computers with low memory (e.g. 16 Gb) but are very slow for analising large datasets (>10 samples) even in workstations with plenty of resources. Still, setting -contiglen to 1000 or higher can make seqmerge a viable strategy even in those cases. Otherwise, we recommend to use either the sequential or the co-assembly modes.

Regarding the choice of assembler, MEGAHIT and SPAdes work better with short Illumina reads, while Canu and Flye support long reads from PacBio or ONT-Minion. MEGAHIT (the default in SqueezeMeta) is more resource-efficient than SPAdes, consuming less memory, but SPAdes supports more analysis modes and produces slightly better assembly statistics. SqueezeMeta can call SPAdes in three different ways. The option -a spades is meant for metagenomic datasets, and will automatically add the flags –meta -k 21,33,55,77,99,127 to the spades.py call. Conversely, -a rnaspades will add the flags –rna -k 21,33,55,77,99,127. Finally, the option -a spades_base will add no additional flags to the spades.py call. This can be used in conjunction with –assembly options when one wants to fully customize the call to SPAdes, e.g. for assembling single cell genomes.

Analyzing user-supplied assemblies or bins

An user-supplied assembly can be passed to SqueezeMeta with the flag -m extassembly -r <your_assembly.fasta>. The contigs in that fasta file will be analyzed by the SqueezeMeta pipeline starting from step 2. With this, you will be able to annotate your assembly, estimate its abundance in your metagenomes/metatranscriptomes, and perform binning on it.

Additionally, a set of pre-existing genomes and bins can be passed to SqueezeMeta with the flag -m extbins -r <path_to_dir_with_bins>. This will work similarly to -m extassembly, but SqueezeMeta will treat each fasta file in the input directory as an individual bin.

Analyzing metatranscriptomes

SqueezeMeta can be used for de-novo metatranscriptomic assembly, annotation and quantification. Usage is similar as when analizing metagenomes, though we recommend to also provide the --nobins to skip binning.

Regarding the choice of assembler, we have obtained good results with rnaSPAdes (-a rnaspades) although your mileage may vary.

If you have a pre-existing reference assembly or collection of genomes/bins you can use the -extassembly or -extbins flags and skip de-novo assembly, and instead just map the metatranscriptomic reads back to the reference to quantify gene expression.

Combined analysis of metagenomes and metatranscriptomes

SqueezeMeta allows the combined analysis of metagenomes and metatranscriptomes in the same run. The recommended way of doing this is to perform de-novo assembly and binning using only the metagenomes, and then mapping back the metatranscriptomic reads to the assembly for estimating the expression of each contig/gene.

This can be achieved by adding the noassembly and nobinning tags to the metatranscriptomic samples in your samples file. See The samples file for details.

An example would be

Sample1_DNA Sample1_metagenom_R1.fastq.gz    pair1
Sample1_DNA Sample1.metagenom_R2.fastq.gz    pair2
Sample1_RNA Sample1_metatrans_R1.fastq.gz    pair1   noassebly       nobinning
Sample1_RNA Sample1_metatrans_R2.fastq.gz    pair2   noassembly      nobinning
Sample2_DNA Sample2_metagenom_R1.fastq.gz    pair1
Sample2_DNA Sample2_metagenom_R2.fastq.gz    pair2
Sample2_RNA Sample2_metatrans_R1.fastq.gz    pair1   noassembly      nobinning
Sample2_RNA Sample2_metatrans_R2.fastq.gz    pair2   noassembly      nobinning

If you have a pre-existing reference assembly or collection of genomes/bins you can use the --extassembly or -extbins flags and skip de-novo assembly (but if going for binning, the --nobinning flag should still be added to the metatranscriptomes in the samples file).

Alternative analysis modes

In addition to the main SqueezeMeta pipeline, we provide extra scripts that enable the analysis of individual reads and the annotation of sequences

1) sqm_reads.pl: This script performs taxonomic and functional assignments on individual reads rather than contigs. This can be useful when the assembly quality is low, or when looking for low abundance functions that might not have enough coverage to be assembled.

2) sqm_longreads.pl: This script performs taxonomic and functional assignments on individual reads rather than contigs, assuming that more than one ORF can be found in the same read (e.g. as happens in PacBio or MinION reads).

3) sqm_hmm_reads.pl: This script provides a wrapper to the Short-Pair software, which allows to screen the reads for particular functions using an ultra-sensitive HMM algorithm.

4) sqm_mapper.pl: This script maps reads to a given reference using one of the included sequence aligners (Bowtie2, BWA), and provides estimation of the abundance of the contigs and ORFs in the reference. Alternatively, it can be used to filter out the reads mapping to a given reference.

5) sqm_annot.pl: This script performs functional and taxonomic annotation for a set of genes, for instance these encoded in a genome (or sets of contigs).

Working with Oxford Nanopore MinION and PacBio reads

Since version 0.3.0, SqueezeMeta is able to seamlessly work with single-end reads. In order to obtain better mappings of MinION and PacBio reads against the assembly, we advise to use minimap2 for read counting, by including the -map minimap2-ont (MinION) or -map minimap2-pb (PacBio) flags when calling SqueezeMeta. We also include the Canu and Flye assemblers, which are specially tailored to work with long, noisy reads. They can be selected by including the -a canu or -a flye flag when calling SqueezeMeta. As a shortcut, the -–minion flag will use both Canu and minimap2 for Oxford Nanopore MinION reads. As an alternative to assembly, we also provide the sqm_longreads.pl script, which will predict and annotate ORFs within individual long reads.

Working in a low-memory environment

In our experience, assembly and DIAMOND alignment against the nr database are the most memory-hungry parts of the pipeline. By default SqueezeMeta will set up the right parameters for DIAMOND and the Canu assembler based on the available memory in the system. DIAMOND memory usage can be manually controlled via the -b parameter (DIAMOND will consume ~5*b Gb of memory according to the documentation, but to be safe we set -b to free_ram/8). Assembly memory usage is trickier, as memory requirements increase with the number of reads in a sample. We have managed to run SqueezeMeta with as much as 42M 2x100 Illumina HiSeq pairs on a virtual machine with only 16Gb of memory. Conceivably, larger samples could be split an assembled in chunks using the merged mode. We include the shortcut flag -–lowmem, which will set DIAMOND block size to 3, and Canu memory usage to 15Gb. This is enough to make SqueezeMeta run on 16Gb of memory, and allows the in situ analysis of Oxford Nanopore MinION reads. Under such computational limitations, we have been able to coassemble and analyze 10 MinION metagenomes (taken from SRA project SRP163045) in less than 4 hours.

Tips for working in a computing cluster

SqueezeMeta will work fine inside a computing cluster, but there are some extra things that must be taken into account. Here is a list in progress based on frequent issues that have been reported.

Run test_install.pl to make sure that everything is properly configured
If using the conda environment, make sure that it is properly activated by your batch script
If an administrator has set up SqueezeMeta for you (and you have no write privileges in the installation directory), make sure they have run make_databases.pl, download_databases.pl or configure_nodb.pl according to the installation instructions. Once again, test_install.pl should tell you whether things seem to be ok
Make sure to request enough memory. See the previous section for a rough guide on what is “enough”. If you get a crash during the assembly or during the annotation step, it will be likely because you ran out of memory
Make sure to manually set the -b parameter so that it matches the amount of memory that you requested divided by 8. Otherwise, SqueezeMeta will assume that it can use all the free memory in the node in which it is running. This is fine if you got a full node for yourself, but will lead to crashes otherwise

Downstream analysis of SqueezeMeta results

SqueezeMeta comes with a variety of options to explore the results and generate different plots. These are fully described in the documentation and in the wiki. Briefly, the three main ways to analyze the output of SqueezeMeta are the following:

1) Integration with R:: We provide the SQMtools R package, which allows to easily load a whole SqueezeMeta project and expose the results into R. The package includes functions to select particular taxa or functions and generate plots, as well as bindings for other popular microbiome analysis packages such as microeco and phyloseq. Additionally, the package exposes all the data generated by SqueezeMeta into R so it can be used with other third-party R packages or for custom analysis scripts. See examples here. SQMtools can also be used in Mac or Windows, meaning that you can run SqueezeMeta in your Linux server and then move the results to your own computer and analyze them there. See advice for this below.

2) Integration with the anvi’o analysis pipeline: We provide a compatibility layer for loading SqueezeMeta results into the anvi’o analysis and visualization platform (http://merenlab.org/software/anvio/). This includes a built-in query language for selecting the contigs to be visualized in the anvi’o interactive interface. See examples here.

We also include utility scripts for generating itol and pavian -compatible outputs.

Analyzing SqueezeMeta results in your desktop computer

Many users run SqueezeMeta remotely (e.g. in a computing cluster). However it is easier to explore the results interactively from your own computer. Since version 1.6.2, we provide an easy way to achieve this.

1) In the system in which you ran SqueezeMeta, run the utility script sqm2zip.py with

sqm2zip.py /path/to/my_project /output/dir

, where /path/to/my_project is the path to the output of SqueezeMeta, and /output/dir an arbitrary output directory.

2) This will generate a file in /output/dir named my_project.zip, which contains the essential files needed to load your project into SQMtools. Transfer this file to your desktop computer.

3) Assuming R is present in your desktop computer, you can install SQMtools with:

if (!require("BiocManager", quietly = TRUE)) { install.packages("BiocManager")}
BiocManager::install("SQMtools")

This will work seamlessly in Windows and Mac computers, for Linux you may need to previously install the libcurl development library.

4) You can load the project directly from the zip file (no need for decompressing) with

import(SQMtools)
SQM = loadSQM("/path/to/my_project.zip")