Utility scripts

Compressing a SqueezeMeta project into a zip file

sqm2zip.py

This script generates a compressed zip file with all the essential information needed to load a SqueezeMeta project into The SQMtools R package. If the directory /path/to/project/results/tables is not present, it will also run sqm2tables.py to generate the required tables (see below).

This script can be found in the /path/to/SqueezeMeta/utils/ directory, but if using conda it will be present in your PATH.

Usage

sqm2zip.py <project_path> <output_dir> [options]

Arguments

Mandatory parameters (positional)

[project_path <path>]: Path to the SqueezeMeta run
[output_dir] <path>: Output directory

Options

[–trusted-functions]: Include only ORFs with highly trusted KEGG and COG assignments in aggregated functional tables. This will be ignored if the /path/to/project/results/tables directory already exists
[–ignore-unclassified]: Ignore reads without assigned functions for TPM calculation (KO, COG, PFAM). This will be ignored if the /path/to/project/results/tables directory already exists
[–force-overwrite]: Write results even if the output file already exists
[–doc]: Print the documentation

Output

A zip file named <project_name>.zip in the directory specified by output_dir. This file can be loaded directly into The SQMtools R package using the loadSQM function.

Generating summary tables

sqm2tables.py

This script generates tabular outputs from a SqueezeMeta run. It will aggregate the abundances of the ORFs assigned to the same feature (be it a given taxon or a given function) and produce tables with features in rows and samples in columns. Note that if you want to create tables coming from a sqm_reads.pl or sqm_longreads.pl run you will need to use the sqmreads2tables.py script instead.

This script can be found in the /path/to/SqueezeMeta/utils/ directory, but if using conda it will be present in your PATH.

Note

This script is now run automatically with default parameters at the end of a SqueezeMeta run, placing the results in the /path/to/project/results/tables directory. You may still want to run it on your own if you want to use non-default parameters.

Usage

sqm2tables.py <project_path> <output_dir> [options]

Arguments

Mandatory parameters (positional)

[project_path <path>]: Path to the SqueezeMeta run
[output_dir <path>]: Output directory

Options

[–trusted-functions]: Include only ORFs with highly trusted KEGG and COG assignments in aggregated functional tables
[–ignore-unclassified]: Ignore reads without assigned functions for TPM calculation
[–force-overwrite]: Write results even if the output file already exists
[–doc]: Print the documentation

Output

<project_name>.orfs.sequences.tsv: ORF sequences
<project_name>.orfs.sequences.tsv: contig sequences
<project_name>.orf.tax.allfilter.tsv: taxonomy of each ORF at the different taxonomic levels. Minimum identity cutoffs for taxonomic assignment are applied to all taxa
<project_name>.orf.tax.prokfilter.tsv: taxonomy of each ORF at the different taxonomic levels. Minimum identity cutoffs for taxonomic assignment are applied to bacteria and archaea, but not to eukaryotes
<project_name>.orf.tax.nofilter.tsv: taxonomy of each ORF at the different taxonomic levels. No identity cutoffs for taxonomic assignment are applied
<project_name>.orf.marker.genes.tsv: CheckM1 marker genes present in each ORF
<project_name>.orf.16S.tsv: RDP taxonomy of the ORFs containing a 16S rRNA gene according to barrnap
<project_name>.contig.tax.allfilter.tsv: consensus taxonomy of each contig at the different taxonomic levels, based on the taxonomy of their constituent ORFs (applying minimum identity cutoffs to all taxa)
<project_name>.contig.tax.prokfilter.tsv: consensus taxonomy of each contig at the different taxonomic levels, based on the taxonomy of their constituent ORFs. Minimum identity cutoffs for taxonomic assignment are applied to bacteria and archaea, but not to eukaryotes)
<project_name>.contig.tax.nofilter.tsv: consensus taxonomy of each contig at the different taxonomic levels, based on the taxonomy of their constituent ORFs. No identity cutoffs for taxonomic assignment are applied
<project_name>.bin.tax.tsv: consensus taxonomy of each bin at the different taxonomic levels, based on the taxonomy of their constituent contigs

Note

See a deeper discussion on the use of identity cutoffs in taxonomic annotation here.

<project_name>.RecA.tsv: coverage of RecA (COG0468) in the different samples.
For each taxonomic rank (superkingdom, phylum, class, order, family, genus, species) the script will produce the following files:
- <project_name>.<rank>.allfilter.abund.tsv: raw abundances of each taxon for that taxonomic rank in the different samples, applying the identity cutoffs for taxonomic assignment
- <project_name>.<rank>.prokfilter.abund.tsv: raw abundances of each taxon for that taxonomic rank in the different samples. Identity cutoffs for taxonomic assignment are applied to prokaryotic (bacteria + archaea) ORFs but not to Eukaryotes
- <project_name>.<rank>.nofilter.abund.tsv: raw abundances of each taxon for that taxonomic rank in the different samples. Identity cutoffs for taxonomic assignment are applied to prokaryotic (bacteria + archaea) ORFs but not to Eukaryotes
For each functional classification system (KO, COG, PFAM, and any external database provided by the user) the script will produce the following files:
- <project_name>.<classification>.names.tsv: extended description of the functional categories in that classification system. For KO and COG the file will contain three fields: ID, Name and Path within the functional hierarchy. For external databases, it will contain only ID and Name.
- <project_name>.<classification>.abunds.tsv: raw read counts of each functional category in the different samples
- <project_name>.<classification>.bases.tsv: raw base counts of each functional category in the different samples
- <project_name>.<classification>.copyNumber.tsv: average copy numbers per genome of each functional category in the different samples. Copy numbers are obtained by dividing the aggregate coverage of each function in each sample by the coverage of RecA (COG0468) in each sample.
- <project_name>.<classification>.tpm.tsv: normalized (TPM) abundances of each functional category in the different samples. This normalization takes into account both sequencing depth and gene length

Note

The --ignore_unclassified flag can be used to control whether unclassified ORFs are counted towards the total for TPM normalization

Note

There are more advanced ways of calculating copy numbers than normalizing by RecA coverage. These can be accessed through The SQMtools R package

sqmreads2tables.py

This script generates tabular outputs from a sqm_reads.pl or sqm_longreads.pl run. It will aggregate the abundances of the ORFs assigned to the same feature (be it a given taxon or a given function) and produce tables with features in rows and samples in columns. It can optionally accept a query argument to generate tables containing only certain taxa and functions.

This script can be found in the /path/to/SqueezeMeta/utils/ directory, but if using conda it will be present in your PATH.

Usage

sqmreads2tables.py <project_path> <output_dir> [options]

Arguments

Mandatory parameters (positional)

[project_path <path>]: Path to the SqueezeMeta run
[output_dir <path>]: Output directory

Options

[-q/—query <string>]: Filter the results based on the provided query in order to create tables containing only certain taxa or functions. See Query syntax
[–trusted-functions]: Include only ORFs with highly trusted KEGG and COG assignments in aggregated functional tables
[–force-overwrite]: Write results even if the output file already exists
[–doc]: Print the documentation

Output

For each taxonomic rank (superkingdom, phylum, class, order, family, genus, species) the script will produce the following files:
- <project_name>.<rank>.allfilter.abund.tsv: raw abundances of each taxon for that taxonomic rank in the different samples, applying the identity cutoffs for taxonomic assignment
- <project_name>.<rank>.prokfilter.abund.tsv: raw abundances of each taxon for that taxonomic rank in the different samples. Identity cutoffs for taxonomic assignment are applied to prokaryotic (bacteria + archaea) ORFs but not to Eukaryotes. See Taxonomic annotation of eukaryotic ORFs
- <project_name>.<rank>.nofilter.abund.tsv: raw abundances of each taxon for that taxonomic rank in the different samples. Identity cutoffs for taxonomic assignment are applied to prokaryotic (bacteria + archaea) ORFs but not to Eukaryotes
For each functional classification system (KO, COG, PFAM, and any external database provided by the user) the script will produce the following files:
- <project_name>.<classification>.abunds.tsv: raw abundances of each functional category in the different samples
- <project_name>.<classification>.names.tsv: extended description of the functions in that classification system. For KO and COG the file will contain three fields: ID, Name and Path within the functional hierarchy. For external databases, it will contain only ID and Name

Query syntax

Note

This syntax is used by two different scripts: - sqmreads2tables.py script, in order to filter reads annotated with sqm_reads_pl or sqm_longreads.pl - anvi-filter-sqm.py script, in order to filter an anvi’o database obtained after running anvi-load-sqm.py on a SqueezeMeta project

Please enclose query strings within double brackets.
Queries are combinations of relational operations in the form of <SUBJECT> <OPERATOR> <VALUE> (e.g. "PHYLUM == Bacteroidota") joined by logical operators (AND, OR).
Parentheses can be used to group operations together.
The AND and OR logical operators can’t appear together in the same expression. Parentheses must be used to separate them into different expressions. e.g:
- "GENUS == Escherichia OR GENUS == Prevotella AND FUN CONTAINS iron" would not be valid. Parentheses must be used to write either of the following expressions:
  
  "(GENUS == Escherichia OR GENUS == Prevotella)" AND FUN CONTAINS iron" to select features from either Escherichia or Prevotella which contain ORFs related to iron
  
  "GENUS == Escherichia OR (GENUS == Prevotella AND FUN CONTAINS iron)" to select all features from Escherichia and any feature from Prevotella which contains ORFs related to iron
Another example query would be: "(PHYLUM == Bacteroidota OR CLASS IN [Alphaproteobacteria, Gammaproteobacteria]) AND FUN CONTAINS iron AND Sample1 > 1"
- This would select all the features assigned to either the Bacteroidota phylum or the Alphaproteobacteria or Gammaproteobacteria classes, that also contain the substring "iron" in the functional annotations of any of their ORFs, and whose abundance in Sample1 is higher than 1
Possible subjects are:
- FUN: search within all the combined databases used for functional annotation
- FUNH: search within the KEGG BRITE and COG functional hierarchies (e.g. "FUNH CONTAINS Carbohydrate metabolism" will select all the feature containing a gene associated with the broad "Carbohydrate metabolism" category)
- SUPERKINGDOM, PHYLUM, CLASS, ORDER, FAMILY, GENUS, SPECIES: search within the taxonomic annotation at the requested taxonomic rank
- <SAMPLE_NAME> (for anvi-filter-sqm.py only): search within the anvi’o abundances (mean coverage of a split divided by the overall sample mean coverage) in the sample named <SAMPLE_NAME> (e.g. if you have two samples named Sample1 and Sample2, the query string Sample1 > 0.5 AND Sample2 > 0.5 would return the splits with an anvi’o abundance higher than 0.5 in both samples)
Posible relational operators are ==,, !=, >=, <=, >, <, IN, NOT IN, CONTAINS, DOES NOT CONTAIN

combine-sqm-tables.py

Combine tabular outputs from different projects generated either with SqueezeMeta or sqm_(long)reads (but not both at the same time). If the directory /path/to/project/results/tables is not present, it will also run sqm2tables.py or sqmreads2tables.py to generate the required tables.

This script can be found in the /path/to/SqueezeMeta/utils/ directory, but if using conda it will be present in your PATH.

Note

The recommended way of doing is is now using The SQMtools R package - The loadSQM function accepts an arbitrary number of SqueezeMeta projects, loading them into a single SQM object - The combineSQMlite fucntion can be used to combine previously loaded SqueezeMeta and sqm_(long)reads projects into a single object. An advantage of this over combine-sqm-tables.py is that it can be used to combine projects coming from both SqueezeMeta and sqm_(long)reads at the same time.

Usage

combine-sqm-tables.py <project_paths> [options]

Arguments

Mandatory parameters (positional)

[project_paths <paths>]: A space-separated list of paths

Options

[-f|–paths-file <path>]: File containing the paths of the SqueezeMeta projects to combine, one path per line
[-o|–output-dir <path>]: Output directory (default: "combined")
[-p|–output-prefix]: Prefix for the output files (default: "combined")
[–trusted-functions]: Include only ORFs with highly trusted KEGG and COG assignments in aggregated functional tables. This will be ignored if the /path/to/project/results/tables directory already exists
[–ignore-unclassified]: Ignore reads without assigned functions for TPM calculation. This will be ignored if the /path/to/project/results/tables directory already exists or if --sqmreads is passed
[–sqmreads]: Projects were generated using sqm_reads.pl or sqm_longreads.pl
[–force-overwrite]: Write results even if the output directory already exists
[–doc]: Print the documentation

Example calls

Combine projects /path/to/proj1 and /path/to/proj2 and store output in a directory named "outputDir"
- combine-sqm-tables.py /path/to/proj1 /path/to/proj2 -o output_dir
Combine a list of projects contained in a file, use default output dir
- combine-sqm-tables.py -f project_list.txt

Output

Tables containing aggregated counts and feature names for the different functional hierarchies and taxonomic levels for each sample contained in the different projects that were combined. Tables with the TPM and copy number of functions will also be generated for SqueezeMeta runs, but not for sqm_(long)reads runs.

Estimation of the sequencing depth needed for a project

cover.pl

COVER intends to help in the experimental design of metagenomics by addressing the unavoidable question: How much should I sequence to get good results? Or the other way around: I can spend this much money, would it be worth to use it in sequencing the metagenome?

To answer these questions, COVER allows the estimation of the amount of sequencing needed to achieve a particular objective, being this the coverage attained for the most abundant N members of the microbiome. For instance, how much sequence is needed to reach 5x coverage for the four most abundant members (from now on, OTUs). COVER was first published in 2012 (Tamames et al., 2012, Environ Microbiol Rep. 4:335-41), but we are using a different version of the algorithm described there. Details on this implementation can be found in The COVER algorithm.

COVER needs information on the composition of the microbiome, and that must be provided as a file containing 16S rRNA sequences obtained by amplicon sequencing of the target microbiome. If you don’t have that, you can look for a similar sample already sequenced (for instance, in NCBI’s SRA).

This script can be found in the /path/to/SqueezeMeta/utils/ directory, but if using conda it will be present in your PATH.

Usage

cover.pl -i <input_file> [options]

Arguments

Mandatory parameters

[-i <path>]: FASTA file containing 16S rRNA amplicons

Options

[-idcluster <float>]: Identity threshold for collapsing OTUs (default: 0.98)
[-c|-coverage <float>]: Target coverage (default: 5)
[-r|-rank <integer>]: Rank of target OTU (default: 4)

Note

Default values imply looking for 5x coverage for the 4th most abundant 98% OTU

[-cl|-classifier <mothur|rdp>]: Classifier to use (RDP or Mothur) (default: mothur)
[-d|-dir]: Output directory (default: cover)
[-t]: Number of threads (default: 4)

(Default values imply looking for 5 x coverage for the 4 th most abundant OTU)

Output

The output is a table that first lists the amount of sequencing needed, both uncorrected and corrected by the Good’s estimator:

Needed 4775627706 bases, uncorrected
Correcting by unobserved: 6693800053 bases

And then lists the information and coverages for each OTU, with the following columns:

OTU: Name of the OTU
Size: Inferred genomic size of the OTU
Raw abundance: Number of sequences in the OTU
Copy number: Inferred 16S rRNA copy number
Corrected abundance: Abundance n / Σn Abundance
Pi : Probability of sequencing a base of this OTU
%Genome sequenced: Percentage of the genome that will be sequenced for that OTU
Coverage: Coverage that will be obtained for that OTU
Taxon: Deepest taxonomic annotation for the OTU

Adding new databases to an existing project

add_database.pl

This script adds one or several new databases to the results of an existing project. The list of databases must be provided in an external database file as specified in Using external databases for functional annotation. It must be a tab-delimited file with the following format:

<Database Name>      <Path to database>      <Functional annotation file>

The databases to add must also be formatted in DIAMOND format. See Using external databases for functional annotation for details. If the external database file already exists (because you already used some external databases when running SqueezeMeta), DO NOT create a new one. Instead add the new entries to the existing database file.

The script will run Diamond searches for the new databases, and then will re-run several SqueezeMeta scripts to include the new database(s) to the existing results. The following scripts will be invoked:

The outputs of these programs will be regenerated (but all files corresponding to other databases will remain untouched).

This script can be found in the /path/to/SqueezeMeta/utils/ directory, but if using conda it will be present in your PATH.

Usage

add_database.pl <project_path> <database_file>

Integration with external tools

Integration with itol

sqm2itol.pl

This script generates the files for creating a radial plot of abundances using iTOL (https://itol.embl.de/). It can be found in the /path/to/SqueezeMeta/utils/ directory, but if using conda it will be present in your PATH.

Usage

sqm2itol.pl <project_path> [options]

Arguments

Mandatory parameters (positional)

[project_path <path>]: Path to the SqueezeMeta run

Options

[-completion <float>]

Select only bins with percent completion above that threshold (default: 30)

[-contamination <float>]

Select only bins with percent contamination below that threshold (default: 100)

[-classification <metacyc|kegg>]

Functional classification to use (default: metacyc)

[-functions <path>]

File containing the name of the functions to be considered (for functional plots). For example:

arabinose degradation
galactose degradation
glucose degradation

Output

The script will generate several datafiles that you must upload to https://itol.embl.de/ to produce the figure.

Integration with ipath

sqm2ipath.pl

This script creates data on the existence of enzymatic reactions that can be plotted in the interactive pathway mapper iPath (http://pathways.embl.de). It can be found in the /path/to/SqueezeMeta/utils/ directory, but if using conda it will be present in your PATH.

Usage

sqm2ipath.pl <project_path> [options]

Arguments

Mandatory parameters (positional)

[project_path <path>]: Path to the SqueezeMeta run

Options

[-taxon <string>]

Taxon to be plotted (default: plot all taxa)

[-color <string>]

RGB color to be used in the plot (default: red)

[-c|classification <cog|kegg>]

Functional classification to use (default: kegg)

[-functions <file>]

File containing the COG/KEGG identifiers of the functions to be considered. For example:

K00036
K00038
K00040
K00052  #ff0000
K00053

A second argument following the identifier selects the RGB color to be associated to that ID in the plot

The plotting colors can be specified by the -color option, or by associating values to each of the IDs in the functions file. In that case, several colors can be used in the same plot. If no color is specified, the default is red.

[-o|out <path>]

Name of the output file (default: ipath.out)

Output

A file suitable to be uploaded to http://pathways.embl.de. Several output files can be combined, for instance using different colors for different taxa.

Integration with pavian

sqm2pavian.pl

This script produces output files containing abundance of taxa that can be plotted using the Pavian tool (https://github.com/fbreitwieser/pavian). It works with projects generated with SqueezeMeta.pl, sqm_reads.pl or sqm_longreads.pl. It can be found in the /path/to/SqueezeMeta/utils/ directory, but if using conda it will be present in your PATH.

Usage

sqm2pavian.pl <project_path> [mode]

Arguments

Mandatory parameters (positional)

[project_path <path>]: Path to the SqueezeMeta run

Options (positional)

[mode <reads|bases>]: Count abundances in reads or bases (default: reads)

Output

A file named <project>.pavian that can be uploaded in the pavian app (https://fbreitwieser.shinyapps.io/pavian) or in the pavian R package.

Integration with anvi`o

anvi-load-sqm.py

This script creates an anvi’o database from a SqueezeMeta project. The database can then be filtered and visually explored using the anvi-filter-sqm.py script. This script can be found in the /path/to/SqueezeMeta/utils/anvio_utils directory, but if using conda it will be present in your PATH. For this script to work, anvi’o must be installed and present in your PATH.

Note

This has been tested using anvi’o versions 6 and 7. Support is only for released versions, the master/develop branches of anvi’o might (and will likely) not work.

Usage

anvi-load-sqm.py -p <project> -o <output> [options]

Arguments

Mandatory parameters

[-p|-project <path>]: Path to the SqueezeMeta run
[-o|–output <path>]: Output directory

Options

[–num-threads <int>]: Number of threads (default: 12)
[–run-HMMS]: Run the anvi-run-hmms command from anvi’o for identifying single-copy core genes
[–run-scg-taxonomy]: Run the anvi-run-scg-taxonomy command from anvi’o for assigning taxonomy based on single-copy core genes
[–min-contigs-length <int>]: Minimum length of contigs (default: 0)
[–min-mean-coverage <float>]: Minimum mean coverage for contigs (default: 0)
[–skip-SNV-profiling]: Skip the profiling of single nucleotide variants
[–profile-SCVs]: Perform characterization of codon frequencies in genes
[–force-overwrite]: Force overwrite if the output directory already exists
[–doc]: Print the documentation

Output

CONTIGS.db, PROFILE.db, AUXILIARY-DATA.db: anvi’o databases
<project_name>_anvio_contig_taxonomy.txt: contig taxonomy to be loaded by anvi.filter-sqm.py

anvi-filter-sqm.py

This script filters the results of a SqueezeMeta project (previously loaded into to an anvi’o database by the anvi-load-sqm.py script) and opens an anvi’o interactive interface to examine them. Filtering criteria can be specified by using a simple query syntax. This script can be found in the /path/to/SqueezeMeta/utils/anvio_utils directory, but if using conda it will be present in your PATH. For this script to work, anvi’o must be installed and present in your PATH.

Note

This has been tested using anvi’o versions 6 and 7. Support is only for released versions, the master/develop branches of anvi’o might (and will likely) not work.

Usage

anvi-filter-sqm.py -p <profile_db> -c <contigs_db> -t <contigs_taxonomy_file> -q <query> [options]

Arguments

Mandatory parameters

[-p|–profile-db <path>]: anvi’o profile db, as generated by anvi-load-sqm.py
[-c|–contigs-db <path>]: anvi’o contigs db, as generated by anvi-load-sqm.py
[-t|–taxonomy <path>]: Contigs taxonomy, as generated by anvi-load-sqm.py
[-q/—query <string>]: Filter the results based on the provided query in order to visualize only certain taxa or functions at certain abundances. See Query syntax

Options

[-o/–output_dir <path>]: Output directory for the filtered anvi’o databases (default: filteredDB)
[-m|–max-splits <int>]: Maximum number of splits to be loaded into anvi’o. If the provided query returns a higher number of splits, the program will stop. By default it is set to 25,000, larger values may make the anvi’o interface to respond slowly. Setting --max-splits to 0 will allow an arbitrarily large number of splits to be loaded
[–enforce-clustering]: Make anvi’o perform an additional clustering based on abundances across samples and sequence composition
[–extra-anvio-args]: Extra arguments for anvi-interactive, surrounded by quotes (e.g. --extra-anvio-args "--taxonomic-level t_phylum --title Parrot"
[-s <yolo|safe>]: By default, the script uses an in-house method to subset the anvi’o databases. It’s ~5x quicker than using anvi-split in anvi’o5, and works well for us. However, the night is dark and full of bugs, so if you feel that your anvi’o view is missing some information, you can call the script with -s safe parameter. This will call anvi-split which should be much safer than our hacky solution (default: yolo)
[–doc]: Print the documentation

Output

The script will produce a subsetted anvi’o database, and call anvi-interactive to open a browser visualization.

Utility scripts

Compressing a SqueezeMeta project into a zip file

sqm2zip.py

Usage

Arguments

Mandatory parameters (positional)

Options

Output

Generating summary tables

sqm2tables.py

Usage

Arguments

Mandatory parameters (positional)

Options

Output

sqmreads2tables.py

Usage

Arguments

Mandatory parameters (positional)

Options

Output

Query syntax

combine-sqm-tables.py

Usage

Arguments

Mandatory parameters (positional)

Options

Example calls

Output

Estimation of the sequencing depth needed for a project

cover.pl

Usage

Arguments

Mandatory parameters

Options

Output

Adding new databases to an existing project

add_database.pl

Usage

Integration with external tools

Integration with itol

sqm2itol.pl

Usage

Arguments

Mandatory parameters (positional)

Options

Output

Integration with ipath

sqm2ipath.pl

Usage

Arguments

Mandatory parameters (positional)

Options

Output

Integration with pavian

sqm2pavian.pl

Usage

Arguments

Mandatory parameters (positional)

Options (positional)

Output

Integration with anvi`o

anvi-load-sqm.py

Usage

Arguments

Mandatory parameters

Options

Output

anvi-filter-sqm.py

Usage

Arguments

Mandatory parameters

Options

Output

Binning refinement

remove_duplicate_markers.pl

Usage

Output

find_missing_markers.pl

Usage