Scripts, output files and file format
Note
Most of the information contained in the output files listed below can be more easily explored through The SQMtools R package.
Upon startup, the pipeline will initially create the following files:
<project>/creator.txt: text file containing the SqueezeMeta version when the project was created<project>/SqueezeMeta_conf.pl: run configuration. You will need to edit this file and restart if you want to change parameters mid-run<project>/parameters.pl: additional parameters<project>/methods.txt: text file containing names a citations of the software that is called during the run<project>/syslog: text file containing the commands being called by the pipeline and their STDOUT/STDERR outputs, useful for debugging
Step 1: Assembly
Script: 01.run_assembly.pl
Files produced
<project>/results/01.<project>.fasta: FASTA file containing the contigs resulting from the assembly<project>/intermediate/01.<project>.lon: Length of the contigs<project>/intermediate/01.<project>.stats: Some statistics on the assembly (N50, N90, number of reads)
Note
The merged/seqmerge modes will also produce a .fasta and a .lon file for each sample)
Step 2: RNA finding
Script: 02.run_barrnap.pl
Files produced
<project>/results/02.<project>.rnas: FASTA file containing all rRNAs and tRNAs found in the assembly<project>/results/02.<project>.16S: Assignment (RDP classifier) for the 16S rRNAs sequences<project>/intermediate/02.<project>.maskedrna.fasta: Fasta file containing the contigs resulting from the assembly, masking the positions where a rRNA/tRNA was found.
Step 3: Gene prediction
Script: 03.run_prodigal.pl
Files produced
<project>/results/03.<project>.fna: Nucleotide sequences for predicted ORF<project>/results/03.<project>.faa: Aminoacid sequences for predicted ORF<project>/results/03.<project>.gff: Features and position in contigs for each of the predicted genes (this file will be moved to theintermediatedirectory if the-Doption is selected)
Step 4: Homology searching against taxonomic (nr) and functional (COG, KEGG) databases
Script: 04.rundiamond.pl
Files produced
<project>/intermediate/04.<project>.nr.diamond: result of the homology search against the nr database<project>/intermediate/04.<project>.kegg.diamond: result of the homology search against the KEGG database<project>/intermediate/04.<project>.eggnog.diamond: result of the homology search against the eggNOG database<project>/intermediate/DB_BUILD_DATE: date at which the SqueezeMeta database was originally created
Note
If additional databases were provided using the -extdb option, this script will create additional diamond result files for each database
Step 5: HMM search for Pfam database
Script: 05.run_hmmer.pl
Files produced
<project>/intermediate/05.<project>.pfam.hmm: results of the HMM search against the Pfam database
Step 6: Taxonomic assignment
Script: 06.lca.pl
Files produced
<project>/results/06.<project>.fun3.tax.wranks: taxonomic assignments for each ORF, including taxonomic ranks<project>/results/06.<project>.fun3.tax.noidfilter.wranks: same as above, but the assignment is done without considering identity filters (see The LCA algorithm)
Note
These files will be moved to the intermediate directory if the -D option is selected
Step 7: Functional assignment
Script: 07.fun3assign.pl
Files produced
<project>/results/07.<project>.fun3.cog: PFAM functional assignment for each ORF<project>/results/07.<project>.fun3.kegg: PFAM functional assignment for each ORF
Format of these files:
Column 1: Name of the ORF
Column 2: Best hit assignment
Column 3: Best average assignment (see The fun3 algorithm)
Note
These files will be moved to the
intermediatedirectory if the-Doption is selectedIf additional databases were provided using the
-optdboption, this script will create additional result files for each database
<project>/results/07.<project>.pfam: PFAM functional assignment for each ORF
Step 8: Blastx on parts of the contigs without gene prediction or without hits
Script: 08.blastx.pl
This script will only be executed if the -D option was selected.
Files produced
<project>/results/08.<project>.gff: features and position in contigs for each of the Prodigal and BlastX ORFs Blastx<project>/results/08.<project>.fun3.tax.wranks: taxonomic assignment for the mix of Prodigal and BlastX ORFs, including taxonomic ranks<project>/results/08.<project>.fun3.tax.noidfilter.wranks: same as above, but the assignment is done without considering identity filters (see The LCA algorithm)<project>/results/08.<project>.fun3.cog: COG functional assignment for the mix of Prodigal and BlastX ORFs<project>/results/08.<project>.fun3.kegg: KEGG functional assignment for the mix of Prodigal and BlastX ORFs<project>/intermediate/blastx.fna: nucleotide sequences for BlastX ORFs
Note
If additional databases were provided using the -optdb option, this script will create additional result files for each database
Step 9: Taxonomic assignment of contigs
Script: 09.summarycontigs3.pl
Files produced
<project>/intermediate/09.<project>.contiglog: consensus taxonomic assignment for the contigs (see Consensus taxonomic annotation for contigs and bins)
Format of the file:
Column 1: name of the contig
Column 2: taxonomic assignment, with ranks
Column 3: lower rank of the assignment
Column 4: disparity value (see Disparity calculation)
Column 5: number of genes in the contig
Step 10: Mapping of reads to contigs and calculation of abundance measures
Script: 10.mapsamples.pl
Files produced
<project>/results/10.<project>.mappingstat:<project>/intermediate/10.<project>.mapcount: several measurements regarding mapping of reads to ORFs
Format of the file:
Column 1: ORF name
Column 2: ORF length (nucleotides)
Column 3: number of reads mapped to that ORF
Column 4: number of bases mapped to that ORF
Column 5: RPKM value for the ORF
Column 6: coverage value for the ORF (Bases mapped / ORF length)
Column 7: TPM value for the ORF
Column 8: sample to which these abundance values correspond
<project>/intermediate/10.<project>.contigcov: several measurements regarding mapping of reads to contigs
Format of the file:
Column 1: ORF name
Column 2: coverage value for the contig
Column 3: RPKM value for the contig
Column 4: TPM value for the contig
Column 5: contig length (nucleotides)
Column 6: number of reads mapped to that contig
Column 7: number of bases mapped to that contig
Column 8: sample to which these abundance values correspond
Step 11: Calculation of the abundance of all taxa
Script: 11.mcount.pl
Files produced
<project>/results/11.<project>.mcount
Format of the file:
Column 1: taxonomic rank for the taxon
Column 2: taxon
Column 3: accumulated contig size: Sum of the length of all contigs for that taxon
Column 4 (and all even columns from this one): number of reads mapping to the taxon in the corresponding sample
Column 5 (and all odd columns from this one): number of bases mapping to the taxon in the corresponding sample
Step 12: Calculation of the abundance of all functions
Script: 12.funcover.pl
Files produced
<project>/ext_tables/12.<project>.cog.stamp: COG function table for STAMPColumn 1: functional class for the COG
Column 2: COG ID and function name
Column 3 and above: abundance of reads for that COG in the corresponding sample
<project>/ext_tables/12.<project>.kegg.stamp: KEGG function table for STAMPColumn 1: KEGG ID and function name
Column 2 and above: abundance of reads for that KEGG in the corresponding sample
<project>/results/12.<project>.cog.funcover: Several measurements of the abundance and distribution of each COGColumn 1: COG ID
Column 2: sample name
Column 3: number of different ORFs of this function in the corresponding sample (copy number)
Column 4: sum of the length of all ORFs of this function in the corresponding sample (Total length)
Column 5: sum of the bases mapped to all ORFs of this function in the corresponding sample (Total bases)
Column 6: coverage of the function (Total bases / Total length)
Column 7: TPM value for the function
Column 9: number of the different taxa per rank (k: kingdom, p: phylum; c: class; o: order; f: family; g: genus; s: species) in which this COG has been found
Column 10: function of the COG
<project>/results/12.<project>.kegg.funcover: several measurements of the abundance and distribution of each KEGG. This has the same format as thecog.funcoverfile but replacing COGs by KEGGs. Additionally, the function of the KEGG will be present in column 11, while column 10 will contain the name of the KEGG
Note
If additional databases were provided using the -extdb option, this script will create additional result files for each database
Step 13: Creation of the ORF table
Script: 13.mergeannot2.pl
Files produced
<project>/results/13.<project>.orftableColumn 1: ORF name
Column 2: Contig name
Column 3: molecule (CDS or RNA)
Column 4: method of ORF prediction (prodigal, barrnap, blastx)
Column 5: ORF length (nucleotides)
Column 6: ORF length (amino acids)
Column 7: GC percentage for the ORF
Column 8: Gene name
Column 9: Taxonomy for the ORF
Column 10: KEGG ID for the ORF (If a
*sign is shown here, it means that the functional assignment was done by both best hit and best average scores, therefore is more reliable. Otherwise, the assignment was done using just the best hit, but there is evidence of a conflicting annotation)Column 11: KEGG function
Column 12: KEGG functional class
Column 13: COG ID for the ORF (If a * sign is shown here, it means that the functional assignment was done by both best hit and best average scores, therefore is more reliable. Otherwise, the assignment was done using just the best hit, but there is evidence of a conflicting annotation)
Column 14: COG function
Column 15: COG functional class
Column 16: function in the external database provided
Column 17: Pfam annotation
Column 18 and beyond: TPM, coverage, read count and base count for the ORF in the different samples
Note
If additional databases were provided using the -extdb option, functions and functional classes will be shown for each of them after column 15
Step 14: Binning
Script: 14.runbinning.pl
Files produced
<project>/intermediate/binners/maxbin: directory containing fasta files with the contigs assigned to each bin by MaxBin (if selected)<project>/intermediate/binners/metabat: directory containing fasta files with the contigs assigned to each bin by MetaBAT 2 (if selected)<project>/intermediate/binners/concoct: directory containing fasta files with the contigs assigned to each bin by CONCOCT (if selected)
Step 15: Merging bins with DAS Tool
Script: 15.dastool.pl
Files produced
<project>/results/bins: directory containing fasta files with the contigs associated to each bin after integrating the results for all binners with DAS Tool. If only one binner was selected, DAS Tool will not be run and the directory will instead contain the results for that binner
Step 16: Taxonomic assignment of bins
Script: 16.addtax2.pl
Files produced
One taxonomy file for each fasta in the
<project>/results/binsdirectory<project>/intermediate/16.<project>.bintax: consensus taxonomic assignment for the bins (see Consensus taxonomic annotation for contigs and bins)Column 1: binning method
Column 2: name of the bin
Column 3: taxonomic assignment for the bin, with ranks
Column 4: size of the bin (accumulated sum of contig lengths)
Column 5: disparity of the bin (see Disparity calculation)
Note
Note that the taxonomy generated here is the consensus from the individual taxonomic assignments for each contig in the bin, not a GTDB-Tk taxonomy (which would be more precise). That can be achieved by adding the –gtdbtk flag, and is obtained during Step 17: Running CheckM2 and optionally GTDB-Tk on bins
Step 17: Running CheckM2 and optionally GTDB-Tk on bins
Script: 17.checkbins.pl
Files produced
<project>/intermediate/17.<project>.checkM: Raw output from CheckM2- If
--gtdbtkis specified when running SqueezeMeta, also: <project>/intermediate/17.<project>.gtdbtk: GTDB-Tk output for archaeal and bacterial bins combined
- If
Step 18: Creation of the bin table
Script: 18.getbins.pl
Files produced
<project>/intermediate/18.<project>.bincov: coverage and TPM values for each binColumn 1: bin name
Column 2: binning method
Column 3: coverage of the bin in the corresponding sample (Sum of bases from reads in the sample mapped to contigs in the bin / Sum of length of contigs in the bin)
Column 4: TPM for the bin in the corresponding sample (Sum of reads from the corresponding sample mapping to contigs in the bin x 10^6 / Sum of length of contigs in the bin x Total number of reads)
Column 5: sample name
<project>/results/18.<project>.bintable: compilation of all data for binsColumn 1: bin name
Column 2: binning method
Column 3: taxonomic annotation (from the annotations of the contigs)
Column 4: taxonomy for the 16S rRNAs if the bin (if any)
Column 5: bin size (sum of length of the contigs)
Column 6: GC percentage for the bin
Column 7: number of contigs in the bin
Column 8: disparity of the bin
Column 9: completeness of the bin (CheckM2)
Column 10: contamination of the bin (CheckM2)
Column 11: strain heterogeneity of the bin (Empty, since CheckM2 does not provide it, but the field is still present for backwards compatibility)
Column 12 and beyond: coverage and TPM values for the bin in each sample.
Note
If GTDB-Tk was run to classify the bins by adding the -gtdbtk option, an additional column named Tax GTDB-Tk will be present after column 4 in the file <project>/results/18.<project>.bintable
Step 19: Creation of the contig table
Script: 19.getcontigs.pl
Files produced
<project>/intermediate/19.<project>.contigsinbins: list of contigs and corresponding bins<project>/results/19.<project>.contigtable: compilation of data for contigsColumn 1: contig name
Column 2: taxonomic annotation for the contig (from the annotations of the ORFs)
Column 3: disparity of the contig
Column 4: GC percentage for the contig
Column 5: contig length
Column 6: number of genes in the contig
Column 7: bin to which the contig belong (if any)
Column 8 and beyond: values of coverage, TPM and number of mapped reads for the contig in each sample
Step 20: Prediction of pathway presence in bins using MinPath
Script: 20. minpath.pl
Files produced
<project>/results/20.<project>.kegg.pathways: prediction of KEGG pathways in binsColumn 1: bin name
Column 2: taxonomic annotation for the bin
Column 3: number of KEGG pathways found
Column 4 and beyond: NF indicates that the pathway was not predicted. A number shows that the pathway was predicted to be present, and correspond to the number of enzymes of that pathway that were found.
<project>/results/20.<project>.metacyc.pathways: prediction of Metacyc pathways in bins. Format is similar as for the file above
Step 21: Final statistics for the run
Script: 21.stats.pl
Files produced
<project>/results/21.<project>.stats: several statistics regarding ORFs, contigs and bins
Step 22: Calculation of summary tables for the project
Script: sqm2tables.py
Files produced
This script is executed with default parameters at the end of a SqueezeMeta run, and its results are placed in the <project>/results/tables directory. You may still want to run it on your own if you want to use non-default parameters. A list of output files can be found here. This script is executed with default parameters at the end of a SqueezeMeta run, and its results are placed in the <project>/results/tables directory. You may still want to run it on your own if you want to use non-default parameters. A list of output files can be found here.