loadSQM

loadSQM

R Documentation

Load a SqueezeMeta project into R

Description

This function takes the path to a project directory generated by SqueezeMeta (whose name is specified in the -p parameter of the SqueezeMeta.pl script) and parses the results into a SQM object. Alternatively, it can load the project data from a zip file produced by sqm2zip.py.

Usage

loadSQM(
  project_path,
  tax_mode = "prokfilter",
  tax_source = "contigs",
  trusted_functions_only = FALSE,
  single_copy_genes = "MGOGs",
  load_sequences = TRUE,
  engine = "data.table"
)

Arguments

`project_path`	character, a vector of project directories generated by SqueezeMeta, and/or zip files generated by `sqm2zip.py`.
`tax_mode`	character, which taxonomic classification should be loaded? SqueezeMeta applies the identity thresholds described in Luo et al., 2014. Use `allfilter` for applying the minimum identity threshold to all taxa, `prokfilter` for applying the threshold to Bacteria and Archaea, but not to Eukaryotes, and `nofilter` for applying no thresholds at all (default `prokfilter`).
`tax_source`	character, source data used for the taxonomy tables present in `SQM$taxa`, either `"orfs"`, `"contigs"`, `"bins"` (GTDB bin taxonomy if available, SQM bin taxonomy otherwise), `"bins_gtdb"` (GTDB bin taxonomy) or `"bins_sqm"` (SQM bin taxonomy). Default `"contigs"`.
`trusted_functions_only`	logical. If `TRUE`, only highly trusted functional annotations (best hit + best average) will be considered when generating aggregated function tables. If `FALSE`, best hit annotations will be used (default `FALSE`). Will only have an effect if `project_path` is not a zip file, and `project_path/results/tables` is not already present.
`single_copy_genes`	character, source of single copy genes for copy number normalization, either `"RecA"` (COG0468, RecA/RadA), `"MGOGs"` (COGs for 10 single copy and housekeeping genes, Salazar, G et al. 2019), `"MGKOs"` (KOs for 10 single copy and housekeeping genes, Salazar, G et al., 2019) or `"USiCGs"` (KOs for 15 single copy genes, Carr et al., 2013. Table S1). For `"MGOGs"`, `"MGKOs"` and `"USiCGs"`, the median coverage of a set of single copy genes will be used for normalization. Default `"MGOGs"`.
`load_sequences`	logical. If `TRUE`, contig and orf sequences will be loaded in the SQM object. Setting it to `FALSE` will reduce memory usage. Default `TRUE`.
`engine`	character. Engine used to load the ORFs and contigs tables. Either `"data.frame"` or `"data.table"` (significantly faster if your project is large). Default `"data.table"`.

Value

SQM object containing the parsed project. If more than one path is provided in project_path this function will return a SQMbunch object instead. The structure of this object is similar to that of a SQMlite object (see loadSQMlite) but with an extra entry named projects that contains one SQM object for input project. SQM and SQMbunch objects will otherwise behave similarly when used with the subset and plot functions from this package.

Prerequisites

Run SqueezeMeta! An example call for running it would be:

/path/to/SqueezeMeta/scripts/SqueezeMeta.pl

-m coassembly -f fastq_dir -s samples_file -p project_dir

The SQM object structure

The SQM object is a nested list which contains the following information:

lvl1	lvl2	lvl3	type	rows/names	columns	data
$orfs	$table		dataframe	orfs	misc. data	misc. data
	$abund		numeric matrix	orfs	samples	abundances (reads)
	$bases		numeric matrix	orfs	samples	abundances (bases)
	$cov		numeric matrix	orfs	samples	coverages
	$cpm		numeric matrix	orfs	samples	covs. / 10^6 reads
	$tpm		numeric matrix	orfs	samples	tpm
	$seqs		character vector	orfs	(n/a)	sequences
	$tax		character matrix	orfs	tax. ranks	taxonomy
	$tax16S		character vector	orfs	(n/a)	16S rRNA taxonomy
	$tax_abund		See SQM$taxa
	$markers		list	orfs	(n/a)	CheckM1 markers
$contigs	$table		dataframe	contigs	misc. data	misc. data
	$abund		numeric matrix	contigs	samples	abundances (reads)
	$bases		numeric matrix	contigs	samples	abundances (bases)
	$cov		numeric matrix	contigs	samples	coverages
	$cpm		numeric matrix	contigs	samples	covs. / 10^6 reads
	$tpm		numeric matrix	contigs	samples	tpm
	$seqs		character vector	contigs	(n/a)	sequences
	$tax		character matrix	contigs	tax. ranks	taxonomies
	$tax_abund		See SQM$taxa
	$bins		character matrix	contigs	bin. methods	bins
$bins	$table		dataframe	bins	misc. data	misc. data
	$length		numeric vector	bins	(n/a)	length
	$abund		numeric matrix	bins	samples	abundances (reads)
	$percent		numeric matrix	bins	samples	abundances (reads)
	$bases		numeric matrix	bins	samples	abundances (bases)
	$cov		numeric matrix	bins	samples	coverages
	$cpm		numeric matrix	bins	samples	covs. / 10^6 reads
	$tax		character matrix	bins	tax. ranks	taxonomy
	$tax_abund		See SQM$taxa
	$tax_gtdb		character matrix	bins	tax. ranks	GTDB taxonomy
	$tax_abund_gtdb		See SQM$taxa
$taxa	$superkingdom	$abund	numeric matrix	superkingdoms	samples	abundances (reads)
		$percent	numeric matrix	superkingdoms	samples	percentages
	$phylum	$abund	numeric matrix	phyla	samples	abundances (reads)
		$percent	numeric matrix	phyla	samples	percentages
	$class	$abund	numeric matrix	classes	samples	abundances (reads)
		$percent	numeric matrix	classes	samples	percentages
	$order	$abund	numeric matrix	orders	samples	abundances (reads)
		$percent	numeric matrix	orders	samples	percentages
	$family	$abund	numeric matrix	families	samples	abundances (reads)
		$percent	numeric matrix	families	samples	percentages
	$genus	$abund	numeric matrix	genera	samples	abundances (reads)
		$percent	numeric matrix	genera	samples	percentages
	$species	$abund	numeric matrix	species	samples	abundances (reads)
		$percent	numeric matrix	species	samples	percentages
$functions	$KEGG	$abund	numeric matrix	KEGG ids	samples	abundances (reads)
		$bases	numeric matrix	KEGG ids	samples	abundances (bases)
		$cov	numeric matrix	KEGG ids	samples	coverages
		$cpm	numeric matrix	KEGG ids	samples	covs. / 10^6 reads
		$tpm	numeric matrix	KEGG ids	samples	tpm
		$copy_number	numeric matrix	KEGG ids	samples	avg. copies
	$COG	$abund	numeric matrix	COG ids	samples	abundances (reads)
		$bases	numeric matrix	COG ids	samples	abundances (bases)
		$cov	numeric matrix	COG ids	samples	coverages
		$cpm	numeric matrix	COG ids	samples	covs. / 10^6 reads
		$tpm	numeric matrix	COG ids	samples	tpm
		$copy_number	numeric matrix	COG ids	samples	avg. copies
	$PFAM	$abund	numeric matrix	PFAM ids	samples	abundances (reads)
		$bases	numeric matrix	PFAM ids	samples	abundances (bases)
		$cov	numeric matrix	PFAM ids	samples	coverages
		$cpm	numeric matrix	PFAM ids	samples	covs. / 10^6 reads
		$tpm	numeric matrix	PFAM ids	samples	tpm
		$copy_number	numeric matrix	PFAM ids	samples	avg. copies
$total_reads			numeric vector	samples	(n/a)	total reads
$misc	$project_name		character vector	(empty)	(n/a)	project name
	$samples		character vector	(empty)	(n/a)	samples
	$tax_names_long	$superkingdom	character vector	short names	(n/a)	full names
		$phylum	character vector	short names	(n/a)	full names
		$class	character vector	short names	(n/a)	full names
		$order	character vector	short names	(n/a)	full names
		$family	character vector	short names	(n/a)	full names
		$genus	character vector	short names	(n/a)	full names
		$species	character vector	short names	(n/a)	full names
	$tax_names_short		character vector	full names	(n/a)	short names
	$KEGG_names		character vector	KEGG ids	(n/a)	KEGG names
	$KEGG_paths		character vector	KEGG ids	(n/a)	KEGG hiararchy
	$COG_names		character vector	COG ids	(n/a)	COG names
	$COG_paths		character vector	COG ids	(n/a)	COG hierarchy
	$ext_annot_sources		character vector	COG ids	(n/a)	external databases

If external databases for functional classification were provided to SqueezeMeta via the -extdb argument, the corresponding abundance (reads and bases), coverages, tpm and copy number profiles will be present in SQM$functions (e.g. results for the CAZy database would be present in SQM$functions$CAZy). Additionally, the extended names of the features present in the external database will be present in SQM$misc (e.g. SQM$misc$CAZy_names).

Examples

## Not run:
## (outside R)
## Run SqueezeMeta on the test data.
 /path/to/SqueezeMeta/scripts/SqueezeMeta.pl -p Hadza -f raw -m coassembly -s test.samples
## Now go into R.
library(SQMtools)
Hadza = loadSQM("Hadza") # Where Hadza is the path to the SqueezeMeta output directory.

## End(Not run)

data(Hadza) # We will illustrate the structure of the SQM object on the test data
# Which are the ten most abundant KEGG IDs in our data?
topKEGG = names(sort(rowSums(Hadza$functions$KEGG$tpm), decreasing=TRUE))[1:11]
topKEGG = topKEGG[topKEGG!="Unclassified"]
# Which functions do those KEGG IDs represent?
Hadza$misc$KEGG_names[topKEGG]
# What is the relative abundance of the Negativicutes class across samples?
Hadza$taxa$class$percent["Negativicutes",]
# Which information is stored in the orf, contig and bin tables?
colnames(Hadza$orfs$table)
colnames(Hadza$contigs$table)
colnames(Hadza$bins$table)
# What is the GC content distribution of my metagenome?
boxplot(Hadza$contigs$table[,"GC perc"]) # Not weighted by contig length or abundance!