Alternative analysis modes

Taxonomic and functional annotation of unassembled reads

sqm_reads.pl

This procedure performs taxonomic and functional assignments directly on the reads. This is useful when the assembly is not good, usually because of low sequencing depth, high diversity of the microbiome, or both. One indication that this is happening can be found in the mappingstat file. Should you find there low mapping percentages (below 50%), it means that most of your reads are not represented in the assembly and can we can try to classify the reads instead of the genes/contigs.

This script will do a DIAMOND Blastx alignment of reads agains the nr, COG and KEGG databases, and will assign taxa as functions using the lca and fun3 methods, as SqueezeMeta does. It will probably provide an increment in the number of annotations. But on the other hand, the annotations could be less precise (we are working with a smaller sequence) and you lose the capacity to map reads onto an assembly and thus comparing metagenomes using a common reference. Use this for Illumina reads. This method is less suited to analyze long MinION reads where more than one gene can be represented (see sqm_longreads.pl in that case).

This script can be found in the /path/to/SqueezeMeta/utils/ directory, but if using conda it will be present in your PATH.

Usage

The usage of sqm_reads.pl is very similar to that of SqueezeMeta:

sqm_reads.pl -p <project_name> -s <equiv_file> -f <raw_fastq_dir> [options]

Arguments

Mandatory parameters

[-p <string>]: Project name (REQUIRED)
[-s|-samples <path>]: Samples file, see The samples file (REQUIRED)
[-f|-seq <path>]: Fastq read files directory (REQUIRED)

Options

[-–fastnr]: Run DIAMOND in -–fast mode for taxonomic assignment
[–-nocog]: Skip COG assignment
[-–nokegg]: Skip KEGG assignment
[–-nodiamond]: Assumes that Diamond output files are already in place (for instance, if you are redoing the analysis) and skips all Diamond run
[-–euk]: Drop identity filters for eukaryotic annotation (Default: no). This is recommended for analyses in which the eukaryotic population is relevant, as it will yield more annotations. Note that, regardless of whether this option is selected or not, that result will be available as part of the aggregated taxonomy tables generated by sqmreads2tables.py and when loading the project into The SQMtools R package (see Taxonomic annotation of eukaryotic ORFs for more information), so this is only relevant if you are planning to use the intermediate files directly
[-extdb <path>]: File with a list of additional user-provided databases for functional annotations. See Using external function databases
[-t <integer>]: Number of threads (default: 12) [-b|-block-size <float>] Block size for DIAMOND against the nr database (default: calculate automatically)
[-e|-evalue <float]: Max e-value for discarding hits in the DIAMOND run (default: 1e-03)
[-miniden <float>]: Identity percentage for discarding hits in DIAMOND run (default: 50)

Output

Note

The most straightforward way to analyze the results from this script is not to use its output files directly, but rather to produce summary tables for taxonomy and function with sqmreads2tables.py and optionally load them into R using the loadSQMlite function from The SQMtools R package for further exploration. However, we list the output files here for completeness.

The script produces the following files.

<project>.out.allreads: taxonomic and functional assignments for each read. Format of the file:
- Column 1: sample name
- Column 2: read name
- Column 3: corresponding taxon
- Column 4 and beyond Functional assignments (COG, KEGG)
<project>.out.mcount: abundance of all taxa. Format of the file:
- Column 1: taxonomic rank for the taxon
- Column 2: taxon
- Column 3: accumulated read number (number of reads for that taxon in all samples)
- Column 4 and beyond: number of reads for the taxon in the corresponding sample
<project>.out.funcog: abundance of all COG functions. Format of the file:
- Column 1: COG ID
- Column 2: accumulated read number: Number of reads for that COG in all samples
- Column 3 and beyond: number of reads for the COG in the corresponding sample
- Next to last column: COG function
- Last column: COG functional class
<project>.out.funkegg: abundance of all KEGG functions. Format of the file:
- Column 1: KEGG ID
- Column 2: accumulated read number (number of reads for that KEGG in all samples)
- Column 3 and beyond (number of reads for the KEGG in the corresponding sample)
- Next to last column: KEGG function
- Last column: KEGG functional class

sqm_longreads.pl

This script works in the same way as SQM_reads.pl, that is, it attempts to produce taxonomic and functional assignments directly on the raw reads, not using an assembly. The difference is that this script assumes that more than one ORF can be found in the read. It performs Diamond Blastx searches against taxonomic and functional databases, and then identifies ORFs by collapsing the hits in the same region of the read. The --range-culling option of Diamond makes this possible, since it limits the number of hits to the same region of the sequence, making it possible to recover hits for all parts of the read.

The script assigns taxa and functions to each ORF using the lca and fun3 methods, as done by SqueezeMeta. In addition, it calculates a consensus taxonomic assignment for each read (see Consensus taxonomic annotation for contigs and bins). The taxon provided for the read is that consensus annotation.

This script can be found in the /path/to/SqueezeMeta/utils/ directory, but if using conda it will be present in your PATH.

Usage

The usage of sqm_longreads.pl is the same than that of sqm_reads.pl:

sqm_longreads.pl -p <project_name> -s <equiv_file> -f <raw_fastq_dir> [options]

Arguments

Mandatory parameters

[-p <string>]: Project name (REQUIRED)
[-s|-samples <path>]: Samples file, see The samples file (REQUIRED)
[-f|-seq <path>]: Fastq read files directory (REQUIRED)

Options

[-–fastnr]: Run DIAMOND in -–fast mode for taxonomic assignment
[–-nocog]: Skip COG assignment
[-–nokegg]: Skip KEGG assignment
[–-nodiamond]: Assumes that Diamond output files are already in place (for instance, if you are redoing the analysis) and skips all Diamond run
[-–euk]: Drop identity filters for eukaryotic annotation (Default: no). This is recommended for analyses in which the eukaryotic population is relevant, as it will yield more annotations. Note that, regardless of whether this option is selected or not, that result will be available as part of the aggregated taxonomy tables generated by sqmreads2tables.py and when loading the project into The SQMtools R package (see Taxonomic annotation of eukaryotic ORFs for more information), so this is only relevant if you are planning to use the intermediate files directly
[-extdb <path>]: File with a list of additional user-provided databases for functional annotations. See Using external function databases
[-t <integer>]: Number of threads (default: 12)
[-b|-block-size <float>]: Block size for DIAMOND against the nr database (default: calculate automatically)
[-e|-evalue <float]: Max e-value for discarding hits in the DIAMOND run (default: 1e-03)
[-miniden <float>]: Identity percentage for discarding hits in DIAMOND run (default: 50)
[-n|-nopartialhits]: Ignore partial hits if they occur at the middle of a long read
[–force_overwrite]: Overwrite previous results

Output

Note

The most straightforward way to analyze the results from this script is not to use its output files directly, but rather to produce summary tables for taxonomy and function with sqmreads2tables.py and optionally load them into R using the loadSQMlite function from The SQMtools R package for further exploration. However, we list the output files here for completeness.

The output is similar to that of sqm_reads.pl. In addition, sqm_longreads.pl provides information about the consensus in the readconsensus.txt files placed in the output directories for each sample.

Ignoring or not partial hits

A truncated hit (one missing to find one, or both, extremes) often happens in the extremes of the long read (because the read is ending and so is the hit), but it is unexpected to find it in the middle of a long read. There you would expect to see a complete hit. Whatever the reasons for this, the hit is suspicious and can be excluded using the -n option. But beware, this probably will decrease significantly the number of detected ORFs.

Fast screening of unassembled short reads for a particular function

sqm_hmm_reads.pl

This script does functional assignment of the raw reads, using an ultra-sensitive Hidden Markov Model (HMM) search implemented in the third-party software Short-Pair (https://sourceforge.net/projects/short-pair). By using an approximate Bayesian approach employing distribution of fragment lengths and alignment scores, Short-Pair can retrieve the missing end and determine true domains for short paired-end reads (Techa-Angkoon et al., BMC Bioinformatics 18, 414, 2017). This is intended to give an answer to the question “Is my function of interest present in the metagenome?”, avoiding assembly biases where low-abundance genes may be not assembled and therefore will not be represented in the metagenome. This is also expected to be more sensitive than DIAMOND assignment of reads done by sqm_reads.pl and sqm_longreads.pl.

As HMM searches are slower than short-read alignment, it is not practical to do this for all functions. Instead, the user must specify one or several PFAM IDs and the search will be done just for these. The script will connect to the Pfam database (https://pfam.xfam.org) to download the corresponding hmm and seed files. This script can be found in the /path/to/SqueezeMeta/utils/ directory.

This script can be found in the /path/to/SqueezeMeta/utils/ directory, but if using conda it will be present in your PATH.

Usage

sqm_hmm_reads.pl -pfam <PFAM_list> -pair1 <pair1_fasta_file>  -pair2 <pair2_fasta_file> [options]

Arguments

Mandatory parameters

[-pfam <string>]: List of Pfam IDs to retrieve, comma-separated (eg: -pfam PF00069,PF00070) (REQUIRED)
[-pair1 <path>]: Fasta file for pair 1 (REQUIRED)
[-pair2 <path>]: Fasta file for pair 2 (REQUIRED)

Note

Note that -pair1 and -pair2 must be uncompressed fasta files

Options

[-t <int>]: Number of threads (default: 12)
[-output <string>]: Name of the output file (default: SQM_pfam.out)

Output

The output file follows the Short-Pair output format:

First column: read name (.1 for first pair, .2 for second pair)
Second column: Pfam domain family
Third column: alignment score
Fourth column: e-value
Fifth column: start position of alignment on the pfam domain model
Sixth column: end position of alignment on the pfam domain model
Seventh column: start position of alignment on the read
Eighth column: end position of alignment on the read
Ninth column: strand (+ for forward, - for reverse)

Mapping reads to a reference

sqm_mapper.pl

This script maps reads to a given reference using one of the sequence aligners included in SqueezeMeta, and provides estimation of the abundance of the contigs and ORFs in the reference. In addition to the reads and reference files, it also needs a gff file specifying the positions of the genes in the contigs. It works in the same way than the mapping step of the main pipeline, and provides values for the coverage, TPM and RPKM of genes and contigs.

A file including functional annotations for the genes can also be given. If so, the script will provide abundance estimations for functions as well.

This script can be found in the /path/to/SqueezeMeta/utils/ directory, but if using conda it will be present in your PATH.

Usage

sqm_mapper.pl -r <reference> -s <sample_file> -f <reads_directory>  -g <gff_file> -o <output_directory> [options]

Arguments

Mandatory parameters

[-r <path>]: Reference sequence, the one reads will be mapped to. This can be a fasta file containing contigs, or even a single sequence coming from a complete genome (REQUIRED)
[-s <path>]: Samples file, see The samples file (REQUIRED)
[-f <path>]: Fastq read files directory (REQUIRED)
[-g <path>]: GFF file specifying the genomic features in the reference. This can be downloaded for genomes, or created using a gene predictor (REQUIRED unless --filter is also passed). See GFF file format below to know about the proper definition of this file
[-o <string>]: Output directory for storing results (REQUIRED)

Options

[-t <int>]

Number of threads (default: 12)

[-m <bowtie|bwa>]

Aligner to use (default: bowtie)

[–filter]

Use to remove reads mapping to a reference genome

[-n|-name <str>]

Prefix name for the results (default: sqm)

[-fun <path>]

File containing functional annotations for the genes in the reference This is a two-column file. First column indicate the name of the gene, Second column corresponds to the function (or gene name). For instance:

gene1    COG0735
gene2    recA

Output

The script will produce:

A mappingstat file (see Step 10: Mapping of reads to contigs and calculation of abundance measures) , indicating the number of reads and percentage of alignment
A contigcov file, (see Step 10: Mapping of reads to contigs and calculation of abundance measures), with the abundance measures for each of the contigs in the reference
A mapcount file (see Step 10: Mapping of reads to contigs and calculation of abundance measures), with the abundance measures for each ORF in the gff file corresponding to the reference
If a functional file was specified with the -fun option, it will also produce a mapcount.fun file, with the abundance measures for each of the functions.

GFF file format

The gff file (tab separated), should contain a tag ID in its ninth field, with the id being the contigname and, separated by _, the initial and final positions of the gene (separated by -), and a final semicolon. Something like:

ID=contig1_1-580;

an example of a full line in the GFF file would be:

contig1 samplename CDS 1 580 . + 1 ID=contig1_1-580;

Functional and taxonomic annotation of genes and genomes

sqm_annot.pl

This script performs functional and taxonomic annotation for a set of genes or genomes. Genomes must be nucleotide sequences, while gene sequences can be either nucleotides or amino acids. All sequence files must be in fasta format.

For a genome, the script will call SqueezeMeta to predict RNAs and CDS, and then proceeds to run Diamond searches against the usual taxonomic (GenBank nr) and functional (COGs and KEGG) databases and annotate the genes according the same procedures used in the main SqueezeMeta pipeline (LCA for taxa, best hit/best average for functions. Please refer to Explanation of SqueezeMeta algorithms for details). For gene sequences, it is assumed that each sequence corresponds to an already identified ORF, and then RNA and CDS prediction is skipped. Diamond searches are automatically set to “blastp” for amino acids, and “blastx” for nucleotides.

The scripts needs a sample file following the format:

`<sample name> <fasta file name> <genome|aa|nt>`

The first field corresponds to the project name. The script will create a project directory with that name, where all results will be placed. The second field is the name of the genome, amino acid or nucleotide fasta file containing the sequences. And the third field specifies the type of data: genome, aa or nt. As explained above, genome will trigger gene prediction and run Diamond blastp on the predicted peptides, aa will run Diamond blastp for the provided sequences, and nt will run Diamond blastx for the provided sequences.

This script can be found in the /path/to/SqueezeMeta/utils/ directory, but if using conda it will be present in your PATH.

Usage

sqm_annot.pl -s <samples_file> -f <sequence_file_directory> [options]

Arguments

Mandatory parameters

[-f <path>]: Directory in which the sequence files specified in the samples file are located (REQUIRED). The sequence files MUST be in FASTA format.
[-s <path>]: Samples file (REQUIRED)

Options

[-t]: Number of threads (default: 12)
[–notax]: Skips taxonomic annotation
[–nocog]: Skips COGs annotation
[–nokegg]: Skips KEGG annotation
[-extdb <path>]: File with a list of additional user-provided databases for functional annotations. See Using external function databases
[-b <float>]: Block size for DIAMOND against the nr database. Lower values reduce RAM memory usage. Set it to 3 or below for running in a desktop computer (default: 8)

Output

This scripts takes advantage of the standard SqueezeMeta machinery, therefore the output files are these obtained in steps 6 and 7 of the pipeline:

Files coming from Step 6: Taxonomic assignment
- 06.<project>.fun3.tax.wranks: taxonomic assignments for each ORF, including taxonomic ranks
- 06.<project>.fun3.tax.noidfilter.wranks: same as above, but assignment was done not considering identity filters (refer to the explanation of The LCA algorithm)
Files coming from Step 7: Functional assignment
- 07.<sample>.fun3.cog: COG functional assignment for each ORF
- 07.<sample>.fun3.kegg: KEGG functional assignment for each ORF
Summary files
- COG.summary: Counts and functions for each COG
- KEGG.summary: Counts and functions for each KEGG
- Format of these files:
  
  Column 1: COG/KEGG ID
  
  Column 2: Abundance (number of assignments)
  
  Column 3: Name of the gene
  
  Column 4: Function of the gene
  
  Column 5: Functional class or pathway
  
  Column 6: ORFs belonging to current COG/KEGG