Execution, restart and running scripts

Scripts location

The scripts composing the SqueezeMeta pipeline can be found in the /path/to/SqueezeMeta/scripts directory. Other utility scripts can be found in the /path/to/SqueezeMeta/utils directory. See Utility scripts for more information on utility scripts.

Execution

The command for running SqueezeMeta has the following syntax:

SqueezeMeta.pl -m <mode> -p <project_name> -s <samples_file> -f <raw_fastq_dir> <options>

Arguments

Basic parameters

[-m <sequential|coassembly|merged|seqmerge|extassembly|extbins>]: Mode: See Choosing an assembly strategy (REQUIRED)
[-r|-reference <path>]: Path to a fasta file with contigs (if -m extassembly) or to a directory containing external genomes/bins (one fasta file per genome/bin, if -m extbins) (REQUIRED, if -m extassembly or -m extbins)
[-s|-samples <path>]: Samples file, see The samples file (REQUIRED)
[-f|-seq <path>]: Fastq read files directory (REQUIRED)

Output

[-p <path>]: Output path, the basename will be used as the project name (default: SQM)

Restarting

[-–restart]: Restarts the given project where it stopped (project must be speciefied with the -p option) (will NOT overwite previous results, unless -–force_overwrite is also provided)
[-step <int>]: In combination with –-restart, restarts the project starting in the given step number (combine with force_overwrite to regenerate results)
[-–force_overwrite]:: Do not check for previous results, and overwrite existing ones

Filtering

[-–cleaning]: Preprocesses the input reads prior to entering the pipeline
[-cleaning_method <trimmomatic|fastp>]: Preprocessing method (default trimmomatic)
[-cleaning_options <string>]: Extra options for the preprocessing software (default for trimmomatic: "LEADING:8 TRAILING:8 SLIDINGWINDOW:10:15 MINLEN:30", default for fastp: none). Please provide all options as a single quoted string

Assembly

[-a <megahit|spades|rnaspades|spades-base|canu|flye>]: assembler (default: megahit)
[-assembly_options <string>]: Extra options for the assembler (refer to the manual of the specific assembler). Please provide all the extra options as a single quoted string (e.g. -assembly_options "–opt1 foo –opt2 bar")
[-c|-contiglen <int>]: Minimum length of contigs (default: 200)
[-–sq|-–singletons]: Unassembled reads will be treated as contigs and included in the contig fasta file resulting from the assembly. This will produce 100% mapping percentages, and will increase BY A LOT the number of contigs to process. Use with caution
[-contigid <string>]: Prefix id for contigs (default: assembler name)
[–-norename]: Don’t rename contigs (Use at your own risk, characters like - in contig names may make the pipeline crash)

Annotation

[-g <int>]: Number of targets for DIAMOND global ranking during taxonomic assignment (default: 100)
[-db <path>]: Specifies the location of a new taxonomy database (in DIAMOND format, .dmnd). See Using a user-supplied database for taxonomic annotation
[–-nocog]: Skip COG assignment
[-–nokegg]: Skip KEGG assignment
[-–nopfam]: Skip Pfam assignment
[-–fastnr]: Run DIAMOND in -–fast mode for taxonomic assignment
[-–fasternr]: Run DIAMOND in -–faster mode for taxonomic assignment
[-–euk]: Drop identity filters for eukaryotic annotation (Default: no). This is recommended for analyses in which the eukaryotic population is relevant, as it will yield more annotations. Note that, regardless of whether this option is selected or not, that result will be available as part of the aggregated taxonomy tables generated at the last step of the pipeline and also when loading the project into The SQMtools R package (see Taxonomic annotation of eukaryotic ORFs for more information), so this is only relevant if you are planning to use the intermediate files directly
[-consensus <float>]: Minimum percentage of genes assigned to a taxon in order to assign it as the consensus taxonomy for that contig (default: 50)
[-extdb <path>]: File with a list of additional user-provided databases for functional annotation. See Using external function databases
[–D|–-doublepas]: Run BlastX ORF prediction in addition to Prodigal. See Extra-sensitive detection of ORFs
[-diamond_nr_options <string>]: Extra options to be passed when calling DIAMOND against the nr database. Please provide all the extra options as a single quoted string (e.g. -diamond_nr_options "–opt1 foo –opt2 bar")

Mapping

[-map <bowtie|bwa|minimap2-ont|minimap2-pb|minimap2-sr>]: Read mapper (default: bowtie)
[-mapping_options <string>]: Extra options for the mapper (refer to the manual of the specific mapper). Please provide all the extra options as a single quoted string (e.g. -mapping_options "–opt1 foo –opt2 bar")

Binning

[-binners <string>]: Comma-separated list with the binning programs to be used (available: maxbin, metabat2, concoct) (default: concoct,metabat2)
[–-nobins]: Skip all binning (Default: no). Overrides -binners
[-–onlybins]: Run only assembly, binning and bin statistics (including GTDB-Tk if requested)
[-–nomarkers]: Skip retrieval of universal marker genes from bins. Note that, while this precludes recalculation of bin completeness/contamination in SQMtools for bin refining, you will still get completeness/contamination estimates of the original bins obtained in SqueezeMeta
[-–gtdbtk]: Run GTDB-Tk to classify the bins. Requires a working GTDB-Tk installation available in your environment
[-gtdbtk_data_path <path>]: Path to the GTDB database, by default it is assumed to be present in /path/to/SqueezeMeta/db/gtdb. Note that the GTDB database is NOT included in the SqueezeMeta databases, and must be obtained separately

Performance

[-t <integer>]: Number of threads (default: 12)
[-b|-block-size <float>]: Block size for DIAMOND against the nr database (default: calculate automatically)
[-canumem <float>]: Memory for Canu in Gb (default: 32)
[-–lowmem]: Attempt to run on less than 16 Gb of RAM memory. Equivalent to: -b 3 -canumem 15. Note that assembly may still fail due to lack of memory

Other

[-–minion]: Run on MinION reads. Equivalent to -a canu -map minimap2-ont. If canu is not working for you consider using -a flye -map minimap2-ont instead
[-test <integer>]: For testing purposes, stops AFTER the given step number
[-–empty]: Create an empty directory structure and configuration files WITHOUT actually running the pipeline

Information

[-v]: Display version number
[-h]: Display help

Deprecated options

[-extassembly <path>]: External assembly, path to a fasta file with contigs (overrides the assembly step). This still works, but we recommend using -m extassembly -reference <file> instead
[-extbins <path>]: Path to a directory containing external genomes/bins (one fasta file per genome/bin, overrides the assembly and binning steps). This still works, but we recommend using -m extbins -reference <directory> instead
[-taxbinmode <s|c|s+c|c+s>]: Source of taxonomy annotation of bins. This has been deprecated, and SqueezeMeta will always use its own taxonomy (equivalent to -taxbinmode s in older versions) regarless of the value of this argument. You can add the flag --gtdbtk if you need a more precise bin taxonomy in addition to the one provided by default

Example SqueezeMeta call

SqueezeMeta.pl -m coassembly -p test -s test.samples -f mydir --nopfam -miniden 50

This will create a project “test” for co-assembling the samples specified in the file “test.samples”, using a minimum identity of 50% for taxonomic and functional assignment, and skipping Pfam annotation. The -p parameter indicates the name under which all results and data files will be saved. This is not required for sequential mode, where the name will be taken from the samples file instead. The -f parameter indicates the directory where the read files specified in the sample file are stored.

The samples file

The samples file specifies the samples, the names of their corresponding raw read files and the sequencing pair represented in those files, separated by tabulators.

It has the format: <Sample> <filename> <pair1|pair2>

An example would be

Sample1 readfileA_1.fastq   pair1
Sample1 readfileA_2.fastq   pair2
Sample1 readfileB_1.fastq   pair1
Sample1 readfileB_2.fastq   pair2
Sample2 readfileC_1.fastq.gz    pair1
Sample2 readfileC_2.fastq.gz    pair2
Sample3 readfileD_1.fastq   pair1   noassembly
Sample3 readfileD_2.fastq   pair2   noassembly

The first column indicates the sample id (this will be the project name in sequential mode), the second contains the file names of the sequences, and the third specifies the pair number of the reads. A fourth optional column can take the noassembly value, indicating that these sample must not be assembled with the rest (but will be mapped against the assembly to get abundances). This is the case for RNAseq reads that can hamper the assembly but we want them mapped to get transcript abundance of the genes in the assembly. Similarly, an extra column with the nobinning value can be included in order to avoid using those samples for binning. Notice that a sample can have more than one set of paired reads. The sequence files can be in fastq or fasta format, and can be gzipped. If a sample contains paired libraries, it is the user’s responsability to make sure that the forward and reverse files are truly paired (i.e. they contain the same number of reads in the same order). Some quality filtering / trimming tools may produce unpaired filtered fastq files from paired input files (particularly if run without the right parameters). This may result in SqueezeMeta failing or producing incorrect results.

Restart

Any interrupted SqueezeMeta run can be restarted using the program the flag --restart. It has the syntax:

SqueezeMeta.pl -p <projectname> --restart

This command will restart the run of that project by reading the progress.txt file to find out the point where the run stopped.

Alternatively, the run can be restarted from a specific step by issuing the command:

SqueezeMeta.pl -p <projectname> --restart -step <step_to_restart_from>

By default, already completed steps will not be repeated when restarting, even if requested with -step. In order to repeat already completed steps you must also provide the flag --force_overwrite. For example

``SqueezeMeta.pl --restart -p <projectname> -step 6 --force_overwrite

would restart the pipeline from the taxonomic assignment of genes. The different steps of the pipeline are listed in Scripts, output files and file format.

Note

When calling SqueezeMeta with --restart, other parameters will be ignored. If you want to change the configuration of your run, you will need to edit the /path/to/project/SqueezeMeta_conf.pl and change them there before calling SqueezeMeta.pl --restart -p <projectname>.

Running scripts

Also, any individual script of the pipeline can be run using the same syntax:

<script> <projectname>

(for instance, 04.rundiamond.pl <projectname> to repeat the DIAMOND run for the project).