The SQMtools R package
This package provides an easy way to expose the different results of SqueezeMeta (orfs, contigs, bins, taxonomy, functions…) into R, as well as a set of utility functions to filter and display SqueezeMeta results.
Note
The documentation below aims to describe the philosophy, usage, and internals of the SQMtools package, initially described here. For detailed (albeit maybe a bit outdated) usage examples see also the wiki entry.
Design philosophy
SQMtools aims to simplify the analysis of complex metagenomic datasets by representing them as a single object inside the R environment for statistical analysis. Once a project has been loaded into R with SQMtools the user can:
Make basic plots representing its taxonomical and functional composition
Subset it to select only certain taxa, functions or bins/MAGs
Access the individual components of the object (e.g. taxonomy tables, contig sequences, bin qualities, etc) in order to perform custom analyses or feed the data to other R packages. In particular, we provide bindings for microeco and phyloseq, but all the data can also be accessed directly.
The main idea behind SQMtools is to speed up the exploration of metagenomic results by facilitating data inspection and filtering. A standard workflow in SQMtools usually involves:
Inpecting taxonomic/functional patterns in the whole project (e.g. plotting taxonomic distribution across samples)
Locating certain taxa/functions of interest (based on the results of the preliminary inspection, or on prior knowledge about the study system) and generating a subset containing only those taxa/functions
Inspecting taxonomic/functional patterns in the subset. For example, making a subset containing some functions of interest and then making a taxonomic plot of that subset will inform us of the relative abundance and taxonomic distribution of those functions of interest in our samples
Basic workflow of the SQMtools package. The basic unit used in the package is the SQM object. This object can contain a full SqueezeMeta project or a subset of genes, contigs or bins. The data in the SQM object can be accessed directly (e.g. for using it with other R packages such as vegan for ordination analyses or DESeq2 for differential abundance analysis) but we also provide some utility functions for exploring the most abundant functions or taxa in a SQM object. Alternatively, aggregate tables can be loaded into a SQMlite objects, which supports plot and export functionality. SQMlite objects can not be subsetted, but can be combined.
Loading data into SQMtools
The loadSQM function can be used the output of one or more SqueezeMeta.pl runs into a single object. It also works directly with compressed zip projects generated with sqm2zip.py.
library(SQMtools)
project = loadSQM("/path/to/project/")
The code above will generate a single SQM object containing all the information relative to the project. More than one project can be loaded at the same time (e.g. loadSQM( c("/path/to/project1", "/path/to/project2") ). In this case, the call to loadSQM would return a single SQMbunch object with the combined information from all the input projects, which would otherwise behave similarly.
Alternatively, the loadSQMlite function can be used to load aggregated taxonomic and functional tables generated with sqm2tables.py or sqmreads2tables.py. The resulting SQMlite object is much more lightweight, but carries not information on individual ORFs, contigs or bins and does not support subsetting.
The SQM object structure
A SQM object contains all the relevant information from a SqueezeMeta project, organized in nested an R lists (figure below). For example, a matrix with the taxonomic composition of the different samples at the phylum level in percentages can be obtained with project$taxa$phylum$percent while a matrix with the average copy number per genome of the different PFAMs across samples can be obtained with project$functions$PFAM$copy_number.
Structure of the SQM R object. If external databases for functional classification were provided to SqueezeMeta via the -extdb argument, the corresponding abundance (reads and bases), tpm and copy number profiles will be present in SQM$functions (e.g. results for the CAZy database would be present in SQM$functions$CAZy. Additionally, the extended names of the features present in the external database will be present in SQM$misc (e.g. SQM$misc$CAZy_names). The SQMlite object will have a similar structure, but will lack the SQM$orfs, SQM$contigs and SQM$bins section. Additionally, if the results come from a sqm_reads.pl or sqm_longreads.pl run, the SQMlite object will also be missing TPM, bases and copy numbers for the different functional classification methods.
The SQMbunch object structure
A SQMbunch object contains all the relevant information for several SqueezeMeta projects. It has a similar structure to a SQM object, but it lacks the orfs, contigs and bins sections. Instead, it has a separate list named projects, which contains individual SQM objects for each of the projects that were loaded.
SQMbunch objects are meant to work with individual projects as if they were a single one, for the purpose of subsetting and visualization of aggregated taxonomic and functional profiles. This differs from working with a single co-assembly (or merged/seqmerge project) in that assemblies from different projects are not de-replicated, and instead each project contains its own set of potentially redundant contigs, ORFs and bins. For this reason, SQMbunch objects do not support subsetting based on orfs, contigs and bin names, or plotting bin abundance profiles (although it supports filtering bins based on quality and/or taxonomy, and can contain aggregated taxonomy counts based on bin taxonomies).
The SQMlite object structure
A SQMlite object contains aggregated taxonomic and functional tables. It has a similar structure to a SQM object, but it lacks the orfs, contigs and bins sections. They do not support any kind of subsetting, although the sqmreads2tables.py script can be used beforehand with a custom query to generate a set of tables covering only certain taxa or functions.
Creating subsets of your data
SQM and SQMbunch objects can be subsetted to select only certain features of interest. This can be achieved with the following functions:
- Functions working for SQM and SQMbunch objects:
subsetTax: select data from the requested taxon
subsetFun: select data from the requested function/s
subsetBins: select data from the requested bins (filtering by quality and taxonomy; direct selection by bin name will work for SQM objects only)
- Functions working for SQM objects only:
subsetSamples: select the requested samples
subsetContigs: select arbitrary contigs
subsetORFs: select arbitrary ORFs
For example, the code
project.poly = subsetTax(project, "genus", "Polynucleobacter")
would return a new SQM or SQMbunch object containing only the information from contigs that belonged to the Polynucleobacter genus, the ORFs contained in them, and the bins/MAGs that contain those contigs.
Subsetting involves the following steps:
Determining which ORFs, contigs and bins are to be included in the subsetted object
Recalculating aggregated taxonomic and functional tables based on the selected ORFs/contigs/bins
If requested, renormalize certain functional abundance metrics to make them relative to the data included in the subset
Recalculate bin abundance metrics based on the selected contigs. If requested, also recalculate bin completeness/contamination
Note
SQMlite objects can only be subsetted by sample. This means that results coming from sqm_reads.pl and sqm_longreads.pl can not be subsetted by taxa or function within SQMtools. However, a similar effect can by first filtering the results with the the --query parameter of the sqmreads2tables.py script, and then loading the resulting tables into SQMtools with loadSQMlite.
Data renormalization on subsetting
When generating a subset, the TPM and copy number of functions can be rescaled so it becomes relative to the reads included in the subset, instead of to the reads included in the original object.
For example, upon loading a project into SQMtools, the copy number of each function represents the average number of copies of that function per genome in the whole community. If we then run subsetTax to select the contigs belonging to a taxa of interest, the copy numbers in the subsetted object will have a different interpretation depending on whether we rescale or not.
If no rescaling is performed, the copy numbers in the subset will represent the average number of copies of each function in the full metagenome that were coming from the taxa of interest
If rescaling is performed, the copy numbers in the subset will represent the average number of copies of each function in the taxa of interest
For further clarification, compare the following two assertions:
“For each genome in my samples (regardless of its taxonomy) cyanobacteria contributed on average 0.2 toxin production genes”
“In my samples, each cyanobacterial genome had on average 2 toxin production genes”
In addition to this, when generating a subset the completeness and contamination of bins can be recalculated according to only the contigs present in the subset.
Finally, the taxonomy aggregate tables can be recalculated based on the abundance/taxonomies of orfs, contigs or bins (using SqueezeMeta’s or GTDB-Tk annotations if available), respectively.
The different subset functions have different default behaviour. As a rule of thumb, functions that expect to retrieve whole genomes after subsetting (subsetTax, subsetBins) will perform TPM and copy number rescaling, while functions that retreive arbitrary parts of a genome (subsetContigs, subsetORFs) will recalculate bin abundance and statistics.
The default behaviour of each subset function is listed in the table below, but it can be controlled manually through the tax_source, rescale_tpm, rescale_copy_number and recalculate_bin_stats arguments.
Method |
tax_source |
rescale_tpm |
rescale_copy_number |
recalculate_bin_stats |
|---|---|---|---|---|
subsetSamples |
N/A |
N/A |
N/A |
N/A |
subsetTax |
contigs |
TRUE |
TRUE |
FALSE |
subsetFun |
orfs |
FALSE |
FALSE |
FALSE |
subsetBins |
bins |
TRUE |
TRUE |
N/A |
subsetContigs |
contigs |
FALSE |
FALSE |
TRUE |
subsetORFs |
orfs |
FALSE |
FALSE |
TRUE |
Note
Completeness and contamination statistics are initially calculated using CheckM2, but upon subsetting they are recalculated using a re-implementation of the CheckM1 algorithm over root marker genes. This can give an idea on how adding/removing certain contigs affects the completeness of a bin, but should be considered as less reliable than manually running CheckM2 again.
Combining SQM and SQMlite objects
SQMtools also offers ways to combine several SQM and SQMlite objects into a single object:
combineSQM: this function combines several SQM objects into a single object containing all their information
If all the input objects come from the same projects (e.g. because they are different subsets of the same original dataset) this function will also return a SQM object
If the input objects come from different projects (i.e. they were generated from different SqueezeMeta.pl runs on different samples) then this function will return a SQMbunch object
combineSQMlite: this functions combines several SQM and/or SQMlite objects into a single SQMlite object
Creating plots and exporting data
SQMtools offers the utility functions for creating plots and exporting data. The following work with SQM, SQMbunch and SQMlite objects:
The following functions work for SQM and SQMbunch objects only:
The following functions work for SQM objects only:
The following functions work with arbitrary tables/matrices:
Interacting with other R packages
As described in The SQM object structure, all the relevant information is exposed as R data.frames, data.tables, matrices or vectors, and can be used directly as inputs for other R packages or custom analysis scripts. For convenience, we also provide the following functions to load the data contained in an SQM, SQMbunch or SQMlite object into other microbiome analysis libraries:
Working with bins
SQMtools offers some functions for basic bin/MAG curation
find_redundant_contigs: find redundant contigs within a bin
remove_contigs_from_bin: remove the selected contigs from a bin
create_bin: create a new bin from the selected contigs
remove_contigs_from_bin and create_bin will return a new SQM object containing the new binning information, including recalculation of bin contamination/completeness using CheckM1 markers.
Note
While this provides a fast way of removing obviously redundant contigs or creating custom bins within the SQMtools package, if you want to perform a more careful bin/MAG curation you should probably consider other tools such as anvi’o. You can always use the create_bin function to re-create the anvi’o curated bin into SQMtools.
Data normalization strategies
SQMtools normalizes the abundances of taxa, functions and bins in different ways:
- Taxonomy
Raw abundances: raw number of reads mapping to each taxon
Raw bases: raw number of bases mapping to each taxon
Percentages: percent number of reads mapping to each taxon
- Functions, ORFs, contigs (referred to as “features” in the lines below)
Raw abundances: raw number of reads mapping to each feature
Raw bases: raw number of bases mapping to each feature
Coverage: average number of reads mapped per base of each feature
Coverage per million reads: coverage value if only 1M reads were present in each sample
Transcripts per million: number of times we would retreive each feature if sampled 1M features from the population
Copy number: average number of times that each feature is present per genome in our population
- Bins
Raw abundances: raw number of reads mapping to each bin
Raw bases: raw number of bases mapping to each bin
Coverage: average number of reads mapped per base of each bin
Coverage per million reads: coverage value if only 1M reads were present in each sample
More details on this can be found on the SQMtools paper. Our intentions here are to include several common normalization methods so our users can easily pick the ones that suit them the best. At the same time, we understand that novice users can find all these options overwhelming, and would rather prefer to just have the “best” way given to them. While we don’t want to push so hard towards any particular method (since this is anyways a field that is constantly evolving) we have set the defaults of many functions in *SQMtools* to reflect what we currently believe to the the best (or at least safest) approach. Below we discuss our current view on the matter (which is subject to change as we too keep learning better ways of doing things):
Taxonomy: we recommend using percentages for visualization and basic comparisons. While you could in theory do alpha and beta-diversity analyses with this data, there are practical reasons why you should be careful when doing that, since taxonomy from shotgun metagenomic datasets is less reliable than when obtained from marker genes such as 16S rRNA (see Limitations of taxonomic assignments from shotgun metagenomics data). In any case, if you go that route please rarefy your data prior to alpha diversity analysis (and probably beta diversity too, although the jury is still out in that one). A detailed discussion of best practices in alpha and beta diversity analysis is otherwise beyond the scope of this documentation.
Functions: we recommend using copy numbers for mostly everything, when available. These are calculated by dividing the coverage of each feature by the median coverage of a set of universal single copy genes that are assumed to be present only once in all prokaryotic genomes. Thus, the copy number of a feature represents its average number of copies in the genomes of a population. This method controls by both feature lenght and sequencing depth, and since copy numbers are ratios they should be to an extent robust against the compositionality problem. By default, SQMtools will use a set of ten marker genes taken from Salazar et al., (2019). These markers in particular are assumed to be both universal and housekeeping genes, meaning that they can be used to normalize both metagenomics and metatranscriptomics data (though the interpretation of copy numbers for metatranscriptomic data varies, it being the average number of transcripts of a feature that are produced for each transcript of the housekeeping genes). Different single copy genes for normalization can be selected when loading the project into SQMtools with loadSQM.
Bins: we currently favor *cpm* for comparing bin abundances within and between samples, since it controls for genome length and sequencing depth. However, as cpm are proportions, they can be subjected to compositionality biases, particularly when the sample space is dominated by only a few high abundance bins.
Note
If planning to using the data loaded into SQMtools with another R package, check whether that second package uses raw reads as its input, since many packages (e.g. differential abundance packages) do their own normalization internally.
List of functions and detailed documentation
- CheckMProkaryote
- Hadza
- MGKOs
- MGOGs
- RecA
- SQM_to_microeco
- SQM_to_phyloseq
- USiCGs
- combineSQM
- combineSQMlite
- create_bin
- exportBins
- exportContigs
- exportKrona
- exportORFs
- exportPathway
- exportTable
- find_redundant_contigs
- loadSQM
- loadSQMlite
- mostAbundant
- mostVariable
- plotBars
- plotBins
- plotFunctions
- plotHeatmap
- plotTaxonomy
- remove_contigs_from_bin
- renameSamples
- rowMaxs
- rowMins
- seqvec2fasta
- subsetBins
- subsetContigs
- subsetFun
- subsetORFs
- subsetRand
- subsetSamples
- subsetTax
- summary.SQM
- summary.SQMbunch
- summary.SQMlite