******* loadSQM ******* ======= =============== loadSQM R Documentation ======= =============== Load a SqueezeMeta project into R --------------------------------- Description ~~~~~~~~~~~ This function takes the path to a project directory generated by `SqueezeMeta `__ (whose name is specified in the ``-p`` parameter of the SqueezeMeta.pl script) and parses the results into a SQM object. Alternatively, it can load the project data from a zip file produced by ``sqm2zip.py``. Usage ~~~~~ .. code:: R loadSQM( project_path, tax_mode = "prokfilter", tax_source = "contigs", trusted_functions_only = FALSE, single_copy_genes = "MGOGs", load_sequences = TRUE, engine = "data.table" ) Arguments ~~~~~~~~~ +----------------------------+------------------------------------------------------+ | ``project_path`` | character, a vector of project directories generated | | | by SqueezeMeta, and/or zip files generated by | | | ``sqm2zip.py``. | +----------------------------+------------------------------------------------------+ | ``tax_mode`` | character, which taxonomic classification should be | | | loaded? SqueezeMeta applies the identity thresholds | | | described in `Luo et al., | | | 2014 `__. | | | Use ``allfilter`` for applying the minimum identity | | | threshold to all taxa, ``prokfilter`` for applying | | | the threshold to Bacteria and Archaea, but not to | | | Eukaryotes, and ``nofilter`` for applying no | | | thresholds at all (default ``prokfilter``). | +----------------------------+------------------------------------------------------+ | ``tax_source`` | character, source data used for the taxonomy tables | | | present in ``SQM$taxa``, either ``"orfs"``, | | | ``"contigs"``, ``"bins"`` (GTDB bin taxonomy if | | | available, SQM bin taxonomy otherwise), | | | ``"bins_gtdb"`` (GTDB bin taxonomy) or | | | ``"bins_sqm"`` (SQM bin taxonomy). Default | | | ``"contigs"``. | +----------------------------+------------------------------------------------------+ | ``trusted_functions_only`` | logical. If ``TRUE``, only highly trusted functional | | | annotations (best hit + best average) will be | | | considered when generating aggregated function | | | tables. If ``FALSE``, best hit annotations will be | | | used (default ``FALSE``). Will only have an effect | | | if ``project_path`` is not a zip file, and | | | ``project_path/results/tables`` is not already | | | present. | +----------------------------+------------------------------------------------------+ | ``single_copy_genes`` | character, source of single copy genes for copy | | | number normalization, either ``"RecA"`` (COG0468, | | | RecA/RadA), ``"MGOGs"`` (COGs for 10 single copy and | | | housekeeping genes, Salazar, G *et al.* 2019), | | | ``"MGKOs"`` (KOs for 10 single copy and housekeeping | | | genes, Salazar, G *et al.*, 2019) or ``"USiCGs"`` | | | (KOs for 15 single copy genes, Carr *et al.*, 2013. | | | Table S1). For ``"MGOGs"``, ``"MGKOs"`` and | | | ``"USiCGs"``, the median coverage of a set of single | | | copy genes will be used for normalization. Default | | | ``"MGOGs"``. | +----------------------------+------------------------------------------------------+ | ``load_sequences`` | logical. If ``TRUE``, contig and orf sequences will | | | be loaded in the SQM object. Setting it to ``FALSE`` | | | will reduce memory usage. Default ``TRUE``. | +----------------------------+------------------------------------------------------+ | ``engine`` | character. Engine used to load the ORFs and contigs | | | tables. Either ``"data.frame"`` or ``"data.table"`` | | | (significantly faster if your project is large). | | | Default ``"data.table"``. | +----------------------------+------------------------------------------------------+ Value ~~~~~ SQM object containing the parsed project. If more than one path is provided in ``project_path`` this function will return a SQMbunch object instead. The structure of this object is similar to that of a SQMlite object (see ``loadSQMlite``) but with an extra entry named ``projects`` that contains one SQM object for input project. SQM and SQMbunch objects will otherwise behave similarly when used with the subset and plot functions from this package. Prerequisites ~~~~~~~~~~~~~ Run `SqueezeMeta `__! An example call for running it would be: | ``/path/to/SqueezeMeta/scripts/SqueezeMeta.pl`` | ``-m coassembly -f fastq_dir -s samples_file -p project_dir`` The SQM object structure ~~~~~~~~~~~~~~~~~~~~~~~~ The SQM object is a nested list which contains the following information: +------------------+------------------------+-------------------+-------------+----------------+-------------+-------------+ | **lvl1** | **lvl2** | **lvl3** | **type** | **rows/names** | **columns** | **data** | +------------------+------------------------+-------------------+-------------+----------------+-------------+-------------+ | **$orfs** | **$table** | | *dataframe* | orfs | misc. data | misc. data | +------------------+------------------------+-------------------+-------------+----------------+-------------+-------------+ | | **$abund** | | *numeric | orfs | samples | abundances | | | | | matrix* | | | (reads) | +------------------+------------------------+-------------------+-------------+----------------+-------------+-------------+ | | **$bases** | | *numeric | orfs | samples | abundances | | | | | matrix* | | | (bases) | +------------------+------------------------+-------------------+-------------+----------------+-------------+-------------+ | | **$cov** | | *numeric | orfs | samples | coverages | | | | | matrix* | | | | +------------------+------------------------+-------------------+-------------+----------------+-------------+-------------+ | | **$cpm** | | *numeric | orfs | samples | covs. / | | | | | matrix* | | | 10^6 reads | +------------------+------------------------+-------------------+-------------+----------------+-------------+-------------+ | | **$tpm** | | *numeric | orfs | samples | tpm | | | | | matrix* | | | | +------------------+------------------------+-------------------+-------------+----------------+-------------+-------------+ | | **$seqs** | | *character | orfs | (n/a) | sequences | | | | | vector* | | | | +------------------+------------------------+-------------------+-------------+----------------+-------------+-------------+ | | **$tax** | | *character | orfs | tax. ranks | taxonomy | | | | | matrix* | | | | +------------------+------------------------+-------------------+-------------+----------------+-------------+-------------+ | | **$tax16S** | | *character | orfs | (n/a) | 16S rRNA | | | | | vector* | | | taxonomy | +------------------+------------------------+-------------------+-------------+----------------+-------------+-------------+ | | **$tax_abund** | | See | | | | | | | | SQM$taxa | | | | +------------------+------------------------+-------------------+-------------+----------------+-------------+-------------+ | | **$markers** | | *list* | orfs | (n/a) | CheckM1 | | | | | | | | markers | +------------------+------------------------+-------------------+-------------+----------------+-------------+-------------+ | **$contigs** | **$table** | | *dataframe* | contigs | misc. data | misc. data | +------------------+------------------------+-------------------+-------------+----------------+-------------+-------------+ | | **$abund** | | *numeric | contigs | samples | abundances | | | | | matrix* | | | (reads) | +------------------+------------------------+-------------------+-------------+----------------+-------------+-------------+ | | **$bases** | | *numeric | contigs | samples | abundances | | | | | matrix* | | | (bases) | +------------------+------------------------+-------------------+-------------+----------------+-------------+-------------+ | | **$cov** | | *numeric | contigs | samples | coverages | | | | | matrix* | | | | +------------------+------------------------+-------------------+-------------+----------------+-------------+-------------+ | | **$cpm** | | *numeric | contigs | samples | covs. / | | | | | matrix* | | | 10^6 reads | +------------------+------------------------+-------------------+-------------+----------------+-------------+-------------+ | | **$tpm** | | *numeric | contigs | samples | tpm | | | | | matrix* | | | | +------------------+------------------------+-------------------+-------------+----------------+-------------+-------------+ | | **$seqs** | | *character | contigs | (n/a) | sequences | | | | | vector* | | | | +------------------+------------------------+-------------------+-------------+----------------+-------------+-------------+ | | **$tax** | | *character | contigs | tax. ranks | taxonomies | | | | | matrix* | | | | +------------------+------------------------+-------------------+-------------+----------------+-------------+-------------+ | | **$tax_abund** | | See | | | | | | | | SQM$taxa | | | | +------------------+------------------------+-------------------+-------------+----------------+-------------+-------------+ | | **$bins** | | *character | contigs | bin. | bins | | | | | matrix* | | methods | | +------------------+------------------------+-------------------+-------------+----------------+-------------+-------------+ | $bins | **$table** | | *dataframe* | bins | misc. data | misc. data | +------------------+------------------------+-------------------+-------------+----------------+-------------+-------------+ | | **$length** | | *numeric | bins | (n/a) | length | | | | | vector* | | | | +------------------+------------------------+-------------------+-------------+----------------+-------------+-------------+ | | **$abund** | | *numeric | bins | samples | abundances | | | | | matrix* | | | (reads) | +------------------+------------------------+-------------------+-------------+----------------+-------------+-------------+ | | **$percent** | | *numeric | bins | samples | abundances | | | | | matrix* | | | (reads) | +------------------+------------------------+-------------------+-------------+----------------+-------------+-------------+ | | **$bases** | | *numeric | bins | samples | abundances | | | | | matrix* | | | (bases) | +------------------+------------------------+-------------------+-------------+----------------+-------------+-------------+ | | **$cov** | | *numeric | bins | samples | coverages | | | | | matrix* | | | | +------------------+------------------------+-------------------+-------------+----------------+-------------+-------------+ | | **$cpm** | | *numeric | bins | samples | covs. / | | | | | matrix* | | | 10^6 reads | +------------------+------------------------+-------------------+-------------+----------------+-------------+-------------+ | | **$tax** | | *character | bins | tax. ranks | taxonomy | | | | | matrix* | | | | +------------------+------------------------+-------------------+-------------+----------------+-------------+-------------+ | | **$tax_abund** | | See | | | | | | | | SQM$taxa | | | | +------------------+------------------------+-------------------+-------------+----------------+-------------+-------------+ | | **$tax_gtdb** | | *character | bins | tax. ranks | GTDB | | | | | matrix* | | | taxonomy | +------------------+------------------------+-------------------+-------------+----------------+-------------+-------------+ | | **$tax_abund_gtdb** | | See | | | | | | | | SQM$taxa | | | | +------------------+------------------------+-------------------+-------------+----------------+-------------+-------------+ | **$taxa** | **$superkingdom** | **$abund** | *numeric | superkingdoms | samples | abundances | | | | | matrix* | | | (reads) | +------------------+------------------------+-------------------+-------------+----------------+-------------+-------------+ | | | **$percent** | *numeric | superkingdoms | samples | percentages | | | | | matrix* | | | | +------------------+------------------------+-------------------+-------------+----------------+-------------+-------------+ | | **$phylum** | **$abund** | *numeric | phyla | samples | abundances | | | | | matrix* | | | (reads) | +------------------+------------------------+-------------------+-------------+----------------+-------------+-------------+ | | | **$percent** | *numeric | phyla | samples | percentages | | | | | matrix* | | | | +------------------+------------------------+-------------------+-------------+----------------+-------------+-------------+ | | **$class** | **$abund** | *numeric | classes | samples | abundances | | | | | matrix* | | | (reads) | +------------------+------------------------+-------------------+-------------+----------------+-------------+-------------+ | | | **$percent** | *numeric | classes | samples | percentages | | | | | matrix* | | | | +------------------+------------------------+-------------------+-------------+----------------+-------------+-------------+ | | **$order** | **$abund** | *numeric | orders | samples | abundances | | | | | matrix* | | | (reads) | +------------------+------------------------+-------------------+-------------+----------------+-------------+-------------+ | | | **$percent** | *numeric | orders | samples | percentages | | | | | matrix* | | | | +------------------+------------------------+-------------------+-------------+----------------+-------------+-------------+ | | **$family** | **$abund** | *numeric | families | samples | abundances | | | | | matrix* | | | (reads) | +------------------+------------------------+-------------------+-------------+----------------+-------------+-------------+ | | | **$percent** | *numeric | families | samples | percentages | | | | | matrix* | | | | +------------------+------------------------+-------------------+-------------+----------------+-------------+-------------+ | | **$genus** | **$abund** | *numeric | genera | samples | abundances | | | | | matrix* | | | (reads) | +------------------+------------------------+-------------------+-------------+----------------+-------------+-------------+ | | | **$percent** | *numeric | genera | samples | percentages | | | | | matrix* | | | | +------------------+------------------------+-------------------+-------------+----------------+-------------+-------------+ | | **$species** | **$abund** | *numeric | species | samples | abundances | | | | | matrix* | | | (reads) | +------------------+------------------------+-------------------+-------------+----------------+-------------+-------------+ | | | **$percent** | *numeric | species | samples | percentages | | | | | matrix* | | | | +------------------+------------------------+-------------------+-------------+----------------+-------------+-------------+ | **$functions** | **$KEGG** | **$abund** | *numeric | KEGG ids | samples | abundances | | | | | matrix* | | | (reads) | +------------------+------------------------+-------------------+-------------+----------------+-------------+-------------+ | | | **$bases** | *numeric | KEGG ids | samples | abundances | | | | | matrix* | | | (bases) | +------------------+------------------------+-------------------+-------------+----------------+-------------+-------------+ | | | **$cov** | *numeric | KEGG ids | samples | coverages | | | | | matrix* | | | | +------------------+------------------------+-------------------+-------------+----------------+-------------+-------------+ | | | **$cpm** | *numeric | KEGG ids | samples | covs. / | | | | | matrix* | | | 10^6 reads | +------------------+------------------------+-------------------+-------------+----------------+-------------+-------------+ | | | **$tpm** | *numeric | KEGG ids | samples | tpm | | | | | matrix* | | | | +------------------+------------------------+-------------------+-------------+----------------+-------------+-------------+ | | | **$copy_number** | *numeric | KEGG ids | samples | avg. copies | | | | | matrix* | | | | +------------------+------------------------+-------------------+-------------+----------------+-------------+-------------+ | | **$COG** | **$abund** | *numeric | COG ids | samples | abundances | | | | | matrix* | | | (reads) | +------------------+------------------------+-------------------+-------------+----------------+-------------+-------------+ | | | **$bases** | *numeric | COG ids | samples | abundances | | | | | matrix* | | | (bases) | +------------------+------------------------+-------------------+-------------+----------------+-------------+-------------+ | | | **$cov** | *numeric | COG ids | samples | coverages | | | | | matrix* | | | | +------------------+------------------------+-------------------+-------------+----------------+-------------+-------------+ | | | **$cpm** | *numeric | COG ids | samples | covs. / | | | | | matrix* | | | 10^6 reads | +------------------+------------------------+-------------------+-------------+----------------+-------------+-------------+ | | | **$tpm** | *numeric | COG ids | samples | tpm | | | | | matrix* | | | | +------------------+------------------------+-------------------+-------------+----------------+-------------+-------------+ | | | **$copy_number** | *numeric | COG ids | samples | avg. copies | | | | | matrix* | | | | +------------------+------------------------+-------------------+-------------+----------------+-------------+-------------+ | | **$PFAM** | **$abund** | *numeric | PFAM ids | samples | abundances | | | | | matrix* | | | (reads) | +------------------+------------------------+-------------------+-------------+----------------+-------------+-------------+ | | | **$bases** | *numeric | PFAM ids | samples | abundances | | | | | matrix* | | | (bases) | +------------------+------------------------+-------------------+-------------+----------------+-------------+-------------+ | | | **$cov** | *numeric | PFAM ids | samples | coverages | | | | | matrix* | | | | +------------------+------------------------+-------------------+-------------+----------------+-------------+-------------+ | | | **$cpm** | *numeric | PFAM ids | samples | covs. / | | | | | matrix* | | | 10^6 reads | +------------------+------------------------+-------------------+-------------+----------------+-------------+-------------+ | | | **$tpm** | *numeric | PFAM ids | samples | tpm | | | | | matrix* | | | | +------------------+------------------------+-------------------+-------------+----------------+-------------+-------------+ | | | **$copy_number** | *numeric | PFAM ids | samples | avg. copies | | | | | matrix* | | | | +------------------+------------------------+-------------------+-------------+----------------+-------------+-------------+ | **$total_reads** | | | *numeric | samples | (n/a) | total reads | | | | | vector* | | | | +------------------+------------------------+-------------------+-------------+----------------+-------------+-------------+ | **$misc** | **$project_name** | | *character | (empty) | (n/a) | project | | | | | vector* | | | name | +------------------+------------------------+-------------------+-------------+----------------+-------------+-------------+ | | **$samples** | | *character | (empty) | (n/a) | samples | | | | | vector* | | | | +------------------+------------------------+-------------------+-------------+----------------+-------------+-------------+ | | **$tax_names_long** | **$superkingdom** | *character | short names | (n/a) | full names | | | | | vector* | | | | +------------------+------------------------+-------------------+-------------+----------------+-------------+-------------+ | | | **$phylum** | *character | short names | (n/a) | full names | | | | | vector* | | | | +------------------+------------------------+-------------------+-------------+----------------+-------------+-------------+ | | | **$class** | *character | short names | (n/a) | full names | | | | | vector* | | | | +------------------+------------------------+-------------------+-------------+----------------+-------------+-------------+ | | | **$order** | *character | short names | (n/a) | full names | | | | | vector* | | | | +------------------+------------------------+-------------------+-------------+----------------+-------------+-------------+ | | | **$family** | *character | short names | (n/a) | full names | | | | | vector* | | | | +------------------+------------------------+-------------------+-------------+----------------+-------------+-------------+ | | | **$genus** | *character | short names | (n/a) | full names | | | | | vector* | | | | +------------------+------------------------+-------------------+-------------+----------------+-------------+-------------+ | | | **$species** | *character | short names | (n/a) | full names | | | | | vector* | | | | +------------------+------------------------+-------------------+-------------+----------------+-------------+-------------+ | | **$tax_names_short** | | *character | full names | (n/a) | short names | | | | | vector* | | | | +------------------+------------------------+-------------------+-------------+----------------+-------------+-------------+ | | **$KEGG_names** | | *character | KEGG ids | (n/a) | KEGG names | | | | | vector* | | | | +------------------+------------------------+-------------------+-------------+----------------+-------------+-------------+ | | **$KEGG_paths** | | *character | KEGG ids | (n/a) | KEGG | | | | | vector* | | | hiararchy | +------------------+------------------------+-------------------+-------------+----------------+-------------+-------------+ | | **$COG_names** | | *character | COG ids | (n/a) | COG names | | | | | vector* | | | | +------------------+------------------------+-------------------+-------------+----------------+-------------+-------------+ | | **$COG_paths** | | *character | COG ids | (n/a) | COG | | | | | vector* | | | hierarchy | +------------------+------------------------+-------------------+-------------+----------------+-------------+-------------+ | | **$ext_annot_sources** | | *character | COG ids | (n/a) | external | | | | | vector* | | | databases | +------------------+------------------------+-------------------+-------------+----------------+-------------+-------------+ If external databases for functional classification were provided to SqueezeMeta via the ``-extdb`` argument, the corresponding abundance (reads and bases), coverages, tpm and copy number profiles will be present in ``SQM$functions`` (e.g. results for the CAZy database would be present in ``SQM$functions$CAZy``). Additionally, the extended names of the features present in the external database will be present in ``SQM$misc`` (e.g. ``SQM$misc$CAZy_names``). Examples ~~~~~~~~ .. code:: R ## Not run: ## (outside R) ## Run SqueezeMeta on the test data. /path/to/SqueezeMeta/scripts/SqueezeMeta.pl -p Hadza -f raw -m coassembly -s test.samples ## Now go into R. library(SQMtools) Hadza = loadSQM("Hadza") # Where Hadza is the path to the SqueezeMeta output directory. ## End(Not run) data(Hadza) # We will illustrate the structure of the SQM object on the test data # Which are the ten most abundant KEGG IDs in our data? topKEGG = names(sort(rowSums(Hadza$functions$KEGG$tpm), decreasing=TRUE))[1:11] topKEGG = topKEGG[topKEGG!="Unclassified"] # Which functions do those KEGG IDs represent? Hadza$misc$KEGG_names[topKEGG] # What is the relative abundance of the Negativicutes class across samples? Hadza$taxa$class$percent["Negativicutes",] # Which information is stored in the orf, contig and bin tables? colnames(Hadza$orfs$table) colnames(Hadza$contigs$table) colnames(Hadza$bins$table) # What is the GC content distribution of my metagenome? boxplot(Hadza$contigs$table[,"GC perc"]) # Not weighted by contig length or abundance!