loadSQM

loadSQM

R Documentation

Load a SqueezeMeta project into R

Description

This function takes the path to a project directory generated by SqueezeMeta (whose name is specified in the -p parameter of the SqueezeMeta.pl script) and parses the results into a SQM object. Alternatively, it can load the project data from a zip file produced by sqm2zip.py.

Usage

loadSQM(
  project_path,
  tax_mode = "prokfilter",
  trusted_functions_only = FALSE,
  single_copy_genes = "MGOGs",
  load_sequences = TRUE,
  engine = "data.table"
)

Arguments

`project_path`	character, a vector of project directories generated by SqueezeMeta, and/or zip files generated by `sqm2zip.py`.
`tax_mode`	character, which taxonomic classification should be loaded? SqueezeMeta applies the identity thresholds described in Luo et al., 2014. Use `allfilter` for applying the minimum identity threshold to all taxa, `prokfilter` for applying the threshold to Bacteria and Archaea, but not to Eukaryotes, and `nofilter` for applying no thresholds at all (default `prokfilter`).
`trusted_functions_only`	logical. If `TRUE`, only highly trusted functional annotations (best hit + best average) will be considered when generating aggregated function tables. If `FALSE`, best hit annotations will be used (default `FALSE`). Will only have an effect if `project_path` is not a zip file, and `project_path/results/tables` is not already present.
`single_copy_genes`	character, source of single copy genes for copy number normalization, either `RecA` (COG0468, RecA/RadA), `MGOGs` (COGs for 10 single copy and housekeeping genes, Salazar, G et al. 2019), `MGKOs` (KOs for 10 single copy and housekeeping genes, Salazar, G et al., 2019) or `USiCGs` (KOs for 15 single copy genes, Carr et al., 2013. Table S1). For `MGOGs`, `MGKOs` and `USiCGs`, the median coverage of a set of single copy genes will be used for normalization. Default `MGOGs`.
`load_sequences`	logical. If `TRUE`, contig and orf sequences will be loaded in the SQM object. Setting it to `FALSE` will reduce memory usage. Default `TRUE`.
`engine`	character. Engine used to load the ORFs and contigs tables. Either `data.frame` or `data.table` (significantly faster if your project is large). Default `data.table`.

Value

SQM object containing the parsed project. If more than one path is provided in project_path this function will return a SQMbunch object instead. The structure of this object is similar to that of a SQMlite object (see loadSQMlite) but with an extra entry named projects that contains one SQM object for input project. SQM and SQMbunch objects will otherwise behave similarly when used with the subset and plot functions from this package.

Prerequisites

Run SqueezeMeta! An example call for running it would be:

/path/to/SqueezeMeta/scripts/SqueezeMeta.pl

-m coassembly -f fastq_dir -s samples_file -p project_dir

The SQM object structure

The SQM object is a nested list which contains the following information:

lvl1*	lvl2*	lvl3*	type*	rows/ names	co lumns	data*
$orfs	$ table		dat aframe	orfs	misc. data	misc. data
	$ abund		numeric matrix*	orfs	samples	abu ndances (reads)
	$ bases		numeric matrix*	orfs	samples	abu ndances (bases)
	$cov*		numeric matrix*	orfs	samples	co verages
	$cpm*		numeric matrix*	orfs	samples	covs. / 10^6 reads
	$tpm*		numeric matrix*	orfs	samples	tpm
	$seqs		ch aracter vector	orfs	(n/a)	se quences
	$tax*		ch aracter matrix	orfs	tax. ranks	t axonomy
	$t ax16S		ch aracter vector	orfs	(n/a)	16S rRNA t axonomy
	$ma rkers		list	orfs	(n/a)	CheckM1 markers
$co ntigs	$ table		dat aframe	contigs	misc. data	misc. data
	$ abund		numeric matrix*	contigs	samples	abu ndances (reads)
	$ bases		numeric matrix*	contigs	samples	abu ndances (bases)
	$cov*		numeric matrix*	contigs	samples	co verages
	$cpm*		numeric matrix*	contigs	samples	covs. / 10^6 reads
	$tpm*		numeric matrix*	contigs	samples	tpm
	$seqs		ch aracter vector	contigs	(n/a)	se quences
	$tax*		ch aracter matrix	contigs	tax. ranks	tax onomies
	$bins		ch aracter matrix	contigs	bin. methods	bins
$bins	$ table		dat aframe	bins	misc. data	misc. data
	$l ength		numeric vector*	bins	(n/a)	length
	$ abund		numeric matrix*	bins	samples	abu ndances (reads)
	$pe rcent		numeric matrix*	bins	samples	abu ndances (reads)
	$ bases		numeric matrix*	bins	samples	abu ndances (bases)
	$cov*		numeric matrix*	bins	samples	co verages
	$cpm*		numeric matrix*	bins	samples	covs. / 10^6 reads
	$tax*		ch aracter matrix	bins	tax. ranks	t axonomy
	$tax _gtdb		ch aracter matrix	bins	tax. ranks	GTDB t axonomy
$taxa	$ superki ngdom	$ abund	numeric matrix*	superk ingdoms	samples	abu ndances (reads)
		$pe rcent	numeric matrix*	superk ingdoms	samples	perc entages
	$p hylum	$ abund	numeric matrix*	phyla	samples	abu ndances (reads)
		$pe rcent	numeric matrix*	phyla	samples	perc entages
	$ class	$ abund	numeric matrix*	classes	samples	abu ndances (reads)
		$pe rcent	numeric matrix*	classes	samples	perc entages
	$ order	$ abund	numeric matrix*	orders	samples	abu ndances (reads)
		$pe rcent	numeric matrix*	orders	samples	perc entages
	$f amily	$ abund	numeric matrix*	f amilies	samples	abu ndances (reads)
		$pe rcent	numeric matrix*	f amilies	samples	perc entages
	$ genus	$ abund	numeric matrix*	genera	samples	abu ndances (reads)
		$pe rcent	numeric matrix*	genera	samples	perc entages
	$sp ecies	$ abund	numeric matrix*	species	samples	abu ndances (reads)
		$pe rcent	numeric matrix*	species	samples	perc entages
$func tions	$KEGG	$ abund	numeric matrix*	KEGG ids	samples	abu ndances (reads)
		$ bases	numeric matrix*	KEGG ids	samples	abu ndances (bases)
		$cov*	numeric matrix*	KEGG ids	samples	co verages
		$cpm*	numeric matrix*	KEGG ids	samples	covs. / 10^6 reads
		$tpm*	numeric matrix*	KEGG ids	samples	tpm
		$copy_n umber	numeric matrix*	KEGG ids	samples	avg. copies
	$COG*	$ abund	numeric matrix*	COG ids	samples	abu ndances (reads)
		$ bases	numeric matrix*	COG ids	samples	abu ndances (bases)
		$cov*	numeric matrix*	COG ids	samples	co verages
		$cpm*	numeric matrix*	COG ids	samples	covs. / 10^6 reads
		$tpm*	numeric matrix*	COG ids	samples	tpm
		$copy_n umber	numeric matrix*	COG ids	samples	avg. copies
	$PFAM	$ abund	numeric matrix*	PFAM ids	samples	abu ndances (reads)
		$ bases	numeric matrix*	PFAM ids	samples	abu ndances (bases)
		$cov*	numeric matrix*	PFAM ids	samples	co verages
		$cpm*	numeric matrix*	PFAM ids	samples	covs. / 10^6 reads
		$tpm*	numeric matrix*	PFAM ids	samples	tpm
		$copy_n umber	numeric matrix*	PFAM ids	samples	avg. copies
$total_ reads			numeric vector*	samples	(n/a)	total reads
$misc	$ project _name		ch aracter vector	(empty)	(n/a)	project name
	$sa mples		ch aracter vector	(empty)	(n/a)	samples
	$ta x_names _long	$ superki ngdom	ch aracter vector	short names	(n/a)	full names
		$p hylum	ch aracter vector	short names	(n/a)	full names
		$ class	ch aracter vector	short names	(n/a)	full names
		$ order	ch aracter vector	short names	(n/a)	full names
		$f amily	ch aracter vector	short names	(n/a)	full names
		$ genus	ch aracter vector	short names	(n/a)	full names
		$sp ecies	ch aracter vector	short names	(n/a)	full names
	$tax _names_ short		ch aracter vector	full names	(n/a)	short names
	$KEGG_ names*		ch aracter vector	KEGG ids	(n/a)	KEGG names
	$KEGG_ paths*		ch aracter vector	KEGG ids	(n/a)	KEGG hi ararchy
	$COG_ names		ch aracter vector	COG ids	(n/a)	COG names
	$COG_ paths		ch aracter vector	COG ids	(n/a)	COG hi erarchy
	$ext_a nnot_so urces*		ch aracter vector	COG ids	(n/a)	e xternal da tabases

If external databases for functional classification were provided to SqueezeMeta via the -extdb argument, the corresponding abundance (reads and bases), coverages, tpm and copy number profiles will be present in SQM$functions (e.g. results for the CAZy database would be present in SQM$functions$CAZy). Additionally, the extended names of the features present in the external database will be present in SQM$misc (e.g. SQM$misc$CAZy_names).

Examples

## Not run:
## (outside R)
## Run SqueezeMeta on the test data.
 /path/to/SqueezeMeta/scripts/SqueezeMeta.pl -p Hadza -f raw -m coassembly -s test.samples
## Now go into R.
library(SQMtools)
Hadza = loadSQM("Hadza") # Where Hadza is the path to the SqueezeMeta output directory.

## End(Not run)

data(Hadza) # We will illustrate the structure of the SQM object on the test data
# Which are the ten most abundant KEGG IDs in our data?
topKEGG = names(sort(rowSums(Hadza$functions$KEGG$tpm), decreasing=TRUE))[1:11]
topKEGG = topKEGG[topKEGG!="Unclassified"]
# Which functions do those KEGG IDs represent?
Hadza$misc$KEGG_names[topKEGG]
# What is the relative abundance of the Negativicutes class across samples?
Hadza$taxa$class$percent["Negativicutes",]
# Which information is stored in the orf, contig and bin tables?
colnames(Hadza$orfs$table)
colnames(Hadza$contigs$table)
colnames(Hadza$bins$table)
# What is the GC content distribution of my metagenome?
boxplot(Hadza$contigs$table[,"GC perc"]) # Not weighted by contig length or abundance!