Utility scripts
Compressing a SqueezeMeta project into a zip file
sqm2zip.py
This script generates a compressed zip file with all the essential information needed to load a SqueezeMeta project into The SQMtools R package. If the directory /path/to/project/results/tables is not present, it will also run sqm2tables.py to generate the required tables (see below).
This script can be found in the /path/to/SqueezeMeta/utils/ directory, but if using conda it will be present in your PATH.
Usage
sqm2zip.py <project_path> <output_dir> [options]
Arguments
Mandatory parameters (positional)
- [project_path <path>]
Path to the SqueezeMeta run
- [output_dir] <path>
Output directory
Options
- [–trusted-functions]
Include only ORFs with highly trusted KEGG and COG assignments in aggregated functional tables. This will be ignored if the
/path/to/project/results/tablesdirectory already exists- [–ignore-unclassified]
Ignore reads without assigned functions for TPM calculation (KO, COG, PFAM). This will be ignored if the
/path/to/project/results/tablesdirectory already exists- [–force-overwrite]
Write results even if the output file already exists
- [–doc]
Print the documentation
Output
A zip file named <project_name>.zip in the directory specified by output_dir. This file can be loaded directly into The SQMtools R package using the loadSQM function.
Generating summary tables
sqm2tables.py
This script generates tabular outputs from a SqueezeMeta run. It will aggregate the abundances of the ORFs assigned to the same feature (be it a given taxon or a given function) and produce tables with features in rows and samples in columns. Note that if you want to create tables coming from a sqm_reads.pl or sqm_longreads.pl run you will need to use the sqmreads2tables.py script instead.
This script can be found in the /path/to/SqueezeMeta/utils/ directory, but if using conda it will be present in your PATH.
Note
This script is now run automatically with default parameters at the end of a SqueezeMeta run, placing the results in the /path/to/project/results/tables directory. You may still want to run it on your own if you want to use non-default parameters.
Usage
sqm2tables.py <project_path> <output_dir> [options]
Arguments
Mandatory parameters (positional)
- [project_path <path>]
Path to the SqueezeMeta run
- [output_dir <path>]
Output directory
Options
- [–trusted-functions]
Include only ORFs with highly trusted KEGG and COG assignments in aggregated functional tables
- [–ignore-unclassified]
Ignore reads without assigned functions for TPM calculation
- [–force-overwrite]
Write results even if the output file already exists
- [–doc]
Print the documentation
Output
<project_name>.orfs.sequences.tsv: ORF sequences<project_name>.orfs.sequences.tsv: contig sequences<project_name>.orf.tax.allfilter.tsv: taxonomy of each ORF at the different taxonomic levels. Minimum identity cutoffs for taxonomic assignment are applied to all taxa<project_name>.orf.tax.prokfilter.tsv: taxonomy of each ORF at the different taxonomic levels. Minimum identity cutoffs for taxonomic assignment are applied to bacteria and archaea, but not to eukaryotes<project_name>.orf.tax.nofilter.tsv: taxonomy of each ORF at the different taxonomic levels. No identity cutoffs for taxonomic assignment are applied<project_name>.orf.marker.genes.tsv: CheckM1 marker genes present in each ORF<project_name>.orf.16S.tsv: RDP taxonomy of the ORFs containing a 16S rRNA gene according to barrnap<project_name>.contig.tax.allfilter.tsv: consensus taxonomy of each contig at the different taxonomic levels, based on the taxonomy of their constituent ORFs (applying minimum identity cutoffs to all taxa)<project_name>.contig.tax.prokfilter.tsv: consensus taxonomy of each contig at the different taxonomic levels, based on the taxonomy of their constituent ORFs. Minimum identity cutoffs for taxonomic assignment are applied to bacteria and archaea, but not to eukaryotes)<project_name>.contig.tax.nofilter.tsv: consensus taxonomy of each contig at the different taxonomic levels, based on the taxonomy of their constituent ORFs. No identity cutoffs for taxonomic assignment are applied<project_name>.bin.tax.tsv: consensus taxonomy of each bin at the different taxonomic levels, based on the taxonomy of their constituent contigs
Note
See a deeper discussion on the use of identity cutoffs in taxonomic annotation here.
<project_name>.RecA.tsv: coverage of RecA (COG0468) in the different samples.- For each taxonomic rank (superkingdom, phylum, class, order, family, genus, species) the script will produce the following files:
<project_name>.<rank>.allfilter.abund.tsv: raw abundances of each taxon for that taxonomic rank in the different samples, applying the identity cutoffs for taxonomic assignment<project_name>.<rank>.prokfilter.abund.tsv: raw abundances of each taxon for that taxonomic rank in the different samples. Identity cutoffs for taxonomic assignment are applied to prokaryotic (bacteria + archaea) ORFs but not to Eukaryotes<project_name>.<rank>.nofilter.abund.tsv: raw abundances of each taxon for that taxonomic rank in the different samples. Identity cutoffs for taxonomic assignment are applied to prokaryotic (bacteria + archaea) ORFs but not to Eukaryotes
- For each functional classification system (KO, COG, PFAM, and any external database provided by the user) the script will produce the following files:
<project_name>.<classification>.names.tsv: extended description of the functional categories in that classification system. For KO and COG the file will contain three fields: ID, Name and Path within the functional hierarchy. For external databases, it will contain only ID and Name.<project_name>.<classification>.abunds.tsv: raw read counts of each functional category in the different samples<project_name>.<classification>.bases.tsv: raw base counts of each functional category in the different samples<project_name>.<classification>.copyNumber.tsv: average copy numbers per genome of each functional category in the different samples. Copy numbers are obtained by dividing the aggregate coverage of each function in each sample by the coverage of RecA (COG0468) in each sample.<project_name>.<classification>.tpm.tsv: normalized (TPM) abundances of each functional category in the different samples. This normalization takes into account both sequencing depth and gene length
Note
The --ignore_unclassified flag can be used to control whether unclassified ORFs are counted towards the total for TPM normalization
Note
There are more advanced ways of calculating copy numbers than normalizing by RecA coverage. These can be accessed through The SQMtools R package
sqmreads2tables.py
This script generates tabular outputs from a sqm_reads.pl or sqm_longreads.pl run. It will aggregate the abundances of the ORFs assigned to the same feature (be it a given taxon or a given function) and produce tables with features in rows and samples in columns. It can optionally accept a query argument to generate tables containing only certain taxa and functions.
This script can be found in the /path/to/SqueezeMeta/utils/ directory, but if using conda it will be present in your PATH.
Usage
sqmreads2tables.py <project_path> <output_dir> [options]
Arguments
Mandatory parameters (positional)
- [project_path <path>]
Path to the SqueezeMeta run
- [output_dir <path>]
Output directory
Options
- [-q/—query <string>]
Filter the results based on the provided query in order to create tables containing only certain taxa or functions. See Query syntax
- [–trusted-functions]
Include only ORFs with highly trusted KEGG and COG assignments in aggregated functional tables
- [–force-overwrite]
Write results even if the output file already exists
- [–doc]
Print the documentation
Output
- For each taxonomic rank (superkingdom, phylum, class, order, family, genus, species) the script will produce the following files:
<project_name>.<rank>.allfilter.abund.tsv: raw abundances of each taxon for that taxonomic rank in the different samples, applying the identity cutoffs for taxonomic assignment<project_name>.<rank>.prokfilter.abund.tsv: raw abundances of each taxon for that taxonomic rank in the different samples. Identity cutoffs for taxonomic assignment are applied to prokaryotic (bacteria + archaea) ORFs but not to Eukaryotes. See Taxonomic annotation of eukaryotic ORFs<project_name>.<rank>.nofilter.abund.tsv: raw abundances of each taxon for that taxonomic rank in the different samples. Identity cutoffs for taxonomic assignment are applied to prokaryotic (bacteria + archaea) ORFs but not to Eukaryotes
- For each functional classification system (KO, COG, PFAM, and any external database provided by the user) the script will produce the following files:
<project_name>.<classification>.abunds.tsv: raw abundances of each functional category in the different samples<project_name>.<classification>.names.tsv: extended description of the functions in that classification system. For KO and COG the file will contain three fields: ID, Name and Path within the functional hierarchy. For external databases, it will contain only ID and Name
Query syntax
Note
This syntax is used by two different scripts: - sqmreads2tables.py script, in order to filter reads annotated with sqm_reads_pl or sqm_longreads.pl - anvi-filter-sqm.py script, in order to filter an anvi’o database obtained after running anvi-load-sqm.py on a SqueezeMeta project
Please enclose query strings within double brackets.
Queries are combinations of relational operations in the form of
<SUBJECT> <OPERATOR> <VALUE>(e.g."PHYLUM == Bacteroidota") joined by logical operators (AND,OR).Parentheses can be used to group operations together.
- The
ANDandORlogical operators can’t appear together in the same expression. Parentheses must be used to separate them into different expressions. e.g: "GENUS == Escherichia OR GENUS == Prevotella AND FUN CONTAINS iron"would not be valid. Parentheses must be used to write either of the following expressions:"(GENUS == Escherichia OR GENUS == Prevotella)" AND FUN CONTAINS iron"to select features from either Escherichia or Prevotella which contain ORFs related to iron"GENUS == Escherichia OR (GENUS == Prevotella AND FUN CONTAINS iron)"to select all features from Escherichia and any feature from Prevotella which contains ORFs related to iron
- The
- Another example query would be:
"(PHYLUM == Bacteroidota OR CLASS IN [Alphaproteobacteria, Gammaproteobacteria]) AND FUN CONTAINS iron AND Sample1 > 1" This would select all the features assigned to either the Bacteroidota phylum or the Alphaproteobacteria or Gammaproteobacteria classes, that also contain the substring
"iron"in the functional annotations of any of their ORFs, and whose abundance in Sample1 is higher than 1
- Another example query would be:
- Possible subjects are:
FUN: search within all the combined databases used for functional annotationFUNH: search within the KEGG BRITE and COG functional hierarchies (e.g."FUNH CONTAINS Carbohydrate metabolism"will select all the feature containing a gene associated with the broad"Carbohydrate metabolism"category)SUPERKINGDOM,PHYLUM,CLASS,ORDER,FAMILY,GENUS,SPECIES: search within the taxonomic annotation at the requested taxonomic rank<SAMPLE_NAME> (for anvi-filter-sqm.py only): search within the anvi’o abundances (mean coverage of a split divided by the overall sample mean coverage) in the sample named <SAMPLE_NAME> (e.g. if you have two samples named
Sample1andSample2, the query stringSample1 > 0.5 AND Sample2 > 0.5would return the splits with an anvi’o abundance higher than 0.5 in both samples)
Posible relational operators are
==,,!=,>=,<=,>,<,IN,NOT IN,CONTAINS,DOES NOT CONTAIN
combine-sqm-tables.py
Combine tabular outputs from different projects generated either with SqueezeMeta or sqm_(long)reads (but not both at the same time). If the directory /path/to/project/results/tables is not present, it will also run sqm2tables.py or sqmreads2tables.py to generate the required tables.
This script can be found in the /path/to/SqueezeMeta/utils/ directory, but if using conda it will be present in your PATH.
Note
The recommended way of doing is is now using The SQMtools R package
- The loadSQM function accepts an arbitrary number of SqueezeMeta projects, loading them into a single SQM object
- The combineSQMlite fucntion can be used to combine previously loaded SqueezeMeta and sqm_(long)reads projects into a single object. An advantage of this over combine-sqm-tables.py is that it can be used to combine projects coming from both SqueezeMeta and sqm_(long)reads at the same time.
Usage
combine-sqm-tables.py <project_paths> [options]
Arguments
Mandatory parameters (positional)
- [project_paths <paths>]
A space-separated list of paths
Options
- [-f|–paths-file <path>]
File containing the paths of the SqueezeMeta projects to combine, one path per line
- [-o|–output-dir <path>]
Output directory (default:
"combined")- [-p|–output-prefix]
Prefix for the output files (default:
"combined")- [–trusted-functions]
Include only ORFs with highly trusted KEGG and COG assignments in aggregated functional tables. This will be ignored if the
/path/to/project/results/tablesdirectory already exists- [–ignore-unclassified]
Ignore reads without assigned functions for TPM calculation. This will be ignored if the
/path/to/project/results/tablesdirectory already exists or if--sqmreadsis passed- [–sqmreads]
Projects were generated using sqm_reads.pl or sqm_longreads.pl
- [–force-overwrite]
Write results even if the output directory already exists
- [–doc]
Print the documentation
Example calls
- Combine projects
/path/to/proj1and/path/to/proj2and store output in a directory named"outputDir" combine-sqm-tables.py /path/to/proj1 /path/to/proj2 -o output_dir
- Combine projects
- Combine a list of projects contained in a file, use default output dir
combine-sqm-tables.py -f project_list.txt
Output
Tables containing aggregated counts and feature names for the different functional hierarchies and taxonomic levels for each sample contained in the different projects that were combined. Tables with the TPM and copy number of functions will also be generated for SqueezeMeta runs, but not for sqm_(long)reads runs.
Estimation of the sequencing depth needed for a project
cover.pl
COVER intends to help in the experimental design of metagenomics by addressing the unavoidable question: How much should I sequence to get good results? Or the other way around: I can spend this much money, would it be worth to use it in sequencing the metagenome?
To answer these questions, COVER allows the estimation of the amount of sequencing needed to achieve a particular objective, being this the coverage attained for the most abundant N members of the microbiome. For instance, how much sequence is needed to reach 5x coverage for the four most abundant members (from now on, OTUs). COVER was first published in 2012 (Tamames et al., 2012, Environ Microbiol Rep. 4:335-41), but we are using a different version of the algorithm described there. Details on this implementation can be found in The COVER algorithm.
COVER needs information on the composition of the microbiome, and that must be provided as a file containing 16S rRNA sequences obtained by amplicon sequencing of the target microbiome. If you don’t have that, you can look for a similar sample already sequenced (for instance, in NCBI’s SRA).
This script can be found in the /path/to/SqueezeMeta/utils/ directory, but if using conda it will be present in your PATH.
Usage
cover.pl -i <input_file> [options]
Arguments
Mandatory parameters
- [-i <path>]
FASTA file containing 16S rRNA amplicons
Options
- [-idcluster <float>]
Identity threshold for collapsing OTUs (default:
0.98)- [-c|-coverage <float>]
Target coverage (default:
5)- [-r|-rank <integer>]
Rank of target OTU (default:
4)
Note
Default values imply looking for 5x coverage for the 4th most abundant 98% OTU
- [-cl|-classifier <mothur|rdp>]
Classifier to use (RDP or Mothur) (default:
mothur)- [-d|-dir]
Output directory (default:
cover)- [-t]
Number of threads (default:
4)(Default values imply looking for 5 x coverage for the 4 th most abundant OTU)
Output
The output is a table that first lists the amount of sequencing needed, both uncorrected and corrected by the Good’s estimator:
Needed 4775627706 bases, uncorrected
Correcting by unobserved: 6693800053 bases
And then lists the information and coverages for each OTU, with the following columns:
OTU: Name of the OTU
Size: Inferred genomic size of the OTU
Raw abundance: Number of sequences in the OTU
Copy number: Inferred 16S rRNA copy number
Corrected abundance: Abundance n / Σn Abundance
Pi : Probability of sequencing a base of this OTU
%Genome sequenced: Percentage of the genome that will be sequenced for that OTU
Coverage: Coverage that will be obtained for that OTU
Taxon: Deepest taxonomic annotation for the OTU
Adding new databases to an existing project
add_database.pl
This script adds one or several new databases to the results of an existing project. The list of databases must be provided in an external database file as specified in Using external databases for functional annotation. It must be a tab-delimited file with the following format:
<Database Name> <Path to database> <Functional annotation file>
The databases to add must also be formatted in DIAMOND format. See Using external databases for functional annotation for details. If the external database file already exists (because you already used some external databases when running SqueezeMeta), DO NOT create a new one. Instead add the new entries to the existing database file.
The script will run Diamond searches for the new databases, and then will re-run several SqueezeMeta scripts to include the new database(s) to the existing results. The following scripts will be invoked:
The outputs of these programs will be regenerated (but all files corresponding to other databases will remain untouched).
This script can be found in the /path/to/SqueezeMeta/utils/ directory, but if using conda it will be present in your PATH.
Usage
add_database.pl <project_path> <database_file>
Integration with external tools
Integration with itol
sqm2itol.pl
This script generates the files for creating a radial plot of abundances using iTOL (https://itol.embl.de/). It can be found in the /path/to/SqueezeMeta/utils/ directory, but if using conda it will be present in your PATH.
Usage
sqm2itol.pl <project_path> [options]
Arguments
Mandatory parameters (positional)
- [project_path <path>]
Path to the SqueezeMeta run
Options
- [-completion <float>]
Select only bins with percent completion above that threshold (default:
30)- [-contamination <float>]
Select only bins with percent contamination below that threshold (default:
100)- [-classification <metacyc|kegg>]
Functional classification to use (default:
metacyc)- [-functions <path>]
File containing the name of the functions to be considered (for functional plots). For example:
arabinose degradation galactose degradation glucose degradation
Output
The script will generate several datafiles that you must upload to https://itol.embl.de/ to produce the figure.
Integration with ipath
sqm2ipath.pl
This script creates data on the existence of enzymatic reactions that can be plotted in the interactive pathway mapper iPath (http://pathways.embl.de). It can be found in the /path/to/SqueezeMeta/utils/ directory, but if using conda it will be present in your PATH.
Usage
sqm2ipath.pl <project_path> [options]
Arguments
Mandatory parameters (positional)
- [project_path <path>]
Path to the SqueezeMeta run
Options
- [-taxon <string>]
Taxon to be plotted (default: plot all taxa)
- [-color <string>]
RGB color to be used in the plot (default:
red)- [-c|classification <cog|kegg>]
Functional classification to use (default:
kegg)- [-functions <file>]
File containing the COG/KEGG identifiers of the functions to be considered. For example:
K00036 K00038 K00040 K00052 #ff0000 K00053
A second argument following the identifier selects the RGB color to be associated to that ID in the plot
The plotting colors can be specified by the -color option, or by associating values to each of the IDs in the functions file. In that case, several colors can be used in the same plot. If no color is specified, the default is red.
- [-o|out <path>]
Name of the output file (default:
ipath.out)
Output
A file suitable to be uploaded to http://pathways.embl.de. Several output files can be combined, for instance using different colors for different taxa.
Integration with pavian
sqm2pavian.pl
This script produces output files containing abundance of taxa that can be plotted using
the Pavian tool (https://github.com/fbreitwieser/pavian). It works with projects generated with SqueezeMeta.pl, sqm_reads.pl or sqm_longreads.pl. It can be found in the /path/to/SqueezeMeta/utils/ directory, but if using conda it will be present in your PATH.
Usage
sqm2pavian.pl <project_path> [mode]
Arguments
Mandatory parameters (positional)
- [project_path <path>]
Path to the SqueezeMeta run
Options (positional)
- [mode <reads|bases>]
Count abundances in reads or bases (default:
reads)
Output
A file named <project>.pavian that can be uploaded in the pavian app (https://fbreitwieser.shinyapps.io/pavian) or in the pavian R package.
Integration with anvi`o
anvi-load-sqm.py
This script creates an anvi’o database from a SqueezeMeta project. The database can then be filtered and visually explored using the anvi-filter-sqm.py script. This script can be found in the /path/to/SqueezeMeta/utils/anvio_utils directory, but if using conda it will be present in your PATH. For this script to work, anvi’o must be installed and present in your PATH.
Note
This has been tested using anvi’o versions 6 and 7. Support is only for released versions, the master/develop branches of anvi’o might (and will likely) not work.
Usage
anvi-load-sqm.py -p <project> -o <output> [options]
Arguments
Mandatory parameters
- [-p|-project <path>]
Path to the SqueezeMeta run
- [-o|–output <path>]
Output directory
Options
- [–num-threads <int>]
Number of threads (default:
12)- [–run-HMMS]
Run the
anvi-run-hmmscommand from anvi’o for identifying single-copy core genes- [–run-scg-taxonomy]
Run the
anvi-run-scg-taxonomycommand from anvi’o for assigning taxonomy based on single-copy core genes- [–min-contigs-length <int>]
Minimum length of contigs (default:
0)- [–min-mean-coverage <float>]
Minimum mean coverage for contigs (default:
0)- [–skip-SNV-profiling]
Skip the profiling of single nucleotide variants
- [–profile-SCVs]
Perform characterization of codon frequencies in genes
- [–force-overwrite]
Force overwrite if the output directory already exists
- [–doc]
Print the documentation
Output
CONTIGS.db,PROFILE.db,AUXILIARY-DATA.db: anvi’o databases<project_name>_anvio_contig_taxonomy.txt: contig taxonomy to be loaded by anvi.filter-sqm.py
anvi-filter-sqm.py
This script filters the results of a SqueezeMeta project (previously loaded into to an anvi’o database by the anvi-load-sqm.py script) and opens an anvi’o interactive interface to examine them. Filtering criteria can be specified by using
a simple query syntax. This script can be found in the /path/to/SqueezeMeta/utils/anvio_utils directory, but if using conda it will be present in your PATH. For this script to work, anvi’o must be installed and present in your PATH.
Note
This has been tested using anvi’o versions 6 and 7. Support is only for released versions, the master/develop branches of anvi’o might (and will likely) not work.
Usage
anvi-filter-sqm.py -p <profile_db> -c <contigs_db> -t <contigs_taxonomy_file> -q <query> [options]
Arguments
Mandatory parameters
- [-p|–profile-db <path>]
anvi’o profile db, as generated by anvi-load-sqm.py
- [-c|–contigs-db <path>]
anvi’o contigs db, as generated by anvi-load-sqm.py
- [-t|–taxonomy <path>]
Contigs taxonomy, as generated by anvi-load-sqm.py
- [-q/—query <string>]
Filter the results based on the provided query in order to visualize only certain taxa or functions at certain abundances. See Query syntax
Options
- [-o/–output_dir <path>]
Output directory for the filtered anvi’o databases (default:
filteredDB)- [-m|–max-splits <int>]
Maximum number of splits to be loaded into anvi’o. If the provided query returns a higher number of splits, the program will stop. By default it is set to
25,000, larger values may make the anvi’o interface to respond slowly. Setting--max-splitsto0will allow an arbitrarily large number of splits to be loaded- [–enforce-clustering]
Make anvi’o perform an additional clustering based on abundances across samples and sequence composition
- [–extra-anvio-args]
Extra arguments for anvi-interactive, surrounded by quotes (e.g.
--extra-anvio-args "--taxonomic-level t_phylum --title Parrot"- [-s <yolo|safe>]
By default, the script uses an in-house method to subset the anvi’o databases. It’s ~5x quicker than using
anvi-splitin anvi’o5, and works well for us. However, the night is dark and full of bugs, so if you feel that your anvi’o view is missing some information, you can call the script with-s safeparameter. This will callanvi-splitwhich should be much safer than our hacky solution (default:yolo)- [–doc]
Print the documentation
Output
The script will produce a subsetted anvi’o database, and call anvi-interactive to open a browser visualization.
Binning refinement
Note
Some binning refinement functions are also available in The SQMtools R package
remove_duplicate_markers.pl
This script attempts to reduce the contamination of bins by identifying duplicated markers (conserved genes for the given taxa that are expected to be single copy but are found to have more than one) in them. Then, it optimizes the removal of contigs containing these duplicated markers so that only one copy of the gene is left, and no other markers are removed.
This script can be found in the /path/to/SqueezeMeta/utils/ directory, but if using conda it will be present in your PATH.
Usage
remove_duplicate_markers.pl <project_name> [bin_name]
If no bin name is provided, the script will run the analysis for ALL bins in the project.
Output
The scripts produces a new fasta file for the bin with the name refined in the binning
directory (usually in <project>/results/bins/bins). It also runs CheckM
again to redo the statistics for the bin(s). The result of that CheckM run is stored in
<project>/temp/checkm_nodupl.txt
find_missing_markers.pl
This script intends to improve the completeness of the bin, using the CheckM analysis to find contigs from the same taxa of the bin that contain missing markers (those that were not found in any contig of the bin). The user can then decide whether or not including these contigs in the bin.
This script can be found in the /path/to/SqueezeMeta/utils/ directory, but if using conda it will be present in your PATH.
Usage
find_missing_markers.pl <project_name> [bin_name]
If no bin name is provided, the script will run the analysis for ALL bins in the project.
The script also sets the variable $mode that affects the selection of contigs. Mode
relaxed will consider contigs from all taxa not contradicting the taxonomy of the bin,
including these that belong to higher-rank taxa (for instance, if the bin is annotated as
Escherichia (genus), the script will consider also contigs classified as
Enterobacteriaceae (family), Gammaproteobacteria (class), or even Bacteria
(superkingdom), since these assignments are not incompatible with the one of the bin).
Mode strict will only consider contigs belonging to the same taxa of the bin (in the
example above, only these classified as genus Escherichia).
Output
The script produces a list of contigs containing missing markers for the bin, sorted by the abundance of markers.