loadSQM

loadSQM

R Documentation

Load a SqueezeMeta project into R

Description

This function takes the path to a project directory generated by SqueezeMeta (whose name is specified in the -p parameter of the SqueezeMeta.pl script) and parses the results into a SQM object. Alternatively, it can load the project data from a zip file produced by sqm2zip.py.

Usage

loadSQM(
  project_path,
  tax_mode = "prokfilter",
  trusted_functions_only = FALSE,
  single_copy_genes = "MGOGs",
  load_sequences = TRUE,
  engine = "data.table"
)

Arguments

project_path

character, a vector of project directories generated by SqueezeMeta, and/or zip files generated by sqm2zip.py.

tax_mode

character, which taxonomic classification should be loaded? SqueezeMeta applies the identity thresholds described in Luo et al., 2014. Use allfilter for applying the minimum identity threshold to all taxa, prokfilter for applying the threshold to Bacteria and Archaea, but not to Eukaryotes, and nofilter for applying no thresholds at all (default prokfilter).

trusted_functions_only

logical. If TRUE, only highly trusted functional annotations (best hit + best average) will be considered when generating aggregated function tables. If FALSE, best hit annotations will be used (default FALSE). Will only have an effect if project_path is not a zip file, and project_path/results/tables is not already present.

single_copy_genes

character, source of single copy genes for copy number normalization, either RecA (COG0468, RecA/RadA), MGOGs (COGs for 10 single copy and housekeeping genes, Salazar, G et al. 2019), MGKOs (KOs for 10 single copy and housekeeping genes, Salazar, G et al., 2019) or USiCGs (KOs for 15 single copy genes, Carr et al., 2013. Table S1). For MGOGs, MGKOs and USiCGs, the median coverage of a set of single copy genes will be used for normalization. Default MGOGs.

load_sequences

logical. If TRUE, contig and orf sequences will be loaded in the SQM object. Setting it to FALSE will reduce memory usage. Default TRUE.

engine

character. Engine used to load the ORFs and contigs tables. Either data.frame or data.table (significantly faster if your project is large). Default data.table.

Value

SQM object containing the parsed project. If more than one path is provided in project_path this function will return a SQMbunch object instead. The structure of this object is similar to that of a SQMlite object (see loadSQMlite) but with an extra entry named projects that contains one SQM object for input project. SQM and SQMbunch objects will otherwise behave similarly when used with the subset and plot functions from this package.

Prerequisites

Run SqueezeMeta! An example call for running it would be:

/path/to/SqueezeMeta/scripts/SqueezeMeta.pl
-m coassembly -f fastq_dir -s samples_file -p project_dir

The SQM object structure

The SQM object is a nested list which contains the following information:

lvl1*

lvl2*

lvl3*

type*

rows/ names

co lumns

data*

** $orfs**

$ table

dat aframe

orfs

misc. data

misc. data

$ abund

numeric matrix*

orfs

samples

abu ndances (reads)

$ bases

numeric matrix*

orfs

samples

abu ndances (bases)

$cov*

numeric matrix*

orfs

samples

co verages

$cpm*

numeric matrix*

orfs

samples

covs. / 10^6 reads

$tpm*

numeric matrix*

orfs

samples

tpm

** $seqs**

ch aracter vector

orfs

(n/a)

se quences

$tax*

ch aracter matrix

orfs

tax. ranks

t axonomy

$t ax16S

ch aracter vector

orfs

(n/a)

16S rRNA t axonomy

$ma rkers

list

orfs

(n/a)

CheckM1 markers

$co ntigs

$ table

dat aframe

contigs

misc. data

misc. data

$ abund

numeric matrix*

contigs

samples

abu ndances (reads)

$ bases

numeric matrix*

contigs

samples

abu ndances (bases)

$cov*

numeric matrix*

contigs

samples

co verages

$cpm*

numeric matrix*

contigs

samples

covs. / 10^6 reads

$tpm*

numeric matrix*

contigs

samples

tpm

** $seqs**

ch aracter vector

contigs

(n/a)

se quences

$tax*

ch aracter matrix

contigs

tax. ranks

tax onomies

** $bins**

ch aracter matrix

contigs

bin. methods

bins

$bins

$ table

dat aframe

bins

misc. data

misc. data

$l ength

numeric vector*

bins

(n/a)

length

$ abund

numeric matrix*

bins

samples

abu ndances (reads)

$pe rcent

numeric matrix*

bins

samples

abu ndances (reads)

$ bases

numeric matrix*

bins

samples

abu ndances (bases)

$cov*

numeric matrix*

bins

samples

co verages

$cpm*

numeric matrix*

bins

samples

covs. / 10^6 reads

$tax*

ch aracter matrix

bins

tax. ranks

t axonomy

$tax _gtdb

ch aracter matrix

bins

tax. ranks

GTDB t axonomy

** $taxa**

$ superki ngdom

$ abund

numeric matrix*

superk ingdoms

samples

abu ndances (reads)

$pe rcent

numeric matrix*

superk ingdoms

samples

perc entages

$p hylum

$ abund

numeric matrix*

phyla

samples

abu ndances (reads)

$pe rcent

numeric matrix*

phyla

samples

perc entages

$ class

$ abund

numeric matrix*

classes

samples

abu ndances (reads)

$pe rcent

numeric matrix*

classes

samples

perc entages

$ order

$ abund

numeric matrix*

orders

samples

abu ndances (reads)

$pe rcent

numeric matrix*

orders

samples

perc entages

$f amily

$ abund

numeric matrix*

f amilies

samples

abu ndances (reads)

$pe rcent

numeric matrix*

f amilies

samples

perc entages

$ genus

$ abund

numeric matrix*

genera

samples

abu ndances (reads)

$pe rcent

numeric matrix*

genera

samples

perc entages

$sp ecies

$ abund

numeric matrix*

species

samples

abu ndances (reads)

$pe rcent

numeric matrix*

species

samples

perc entages

$func tions

** $KEGG**

$ abund

numeric matrix*

KEGG ids

samples

abu ndances (reads)

$ bases

numeric matrix*

KEGG ids

samples

abu ndances (bases)

$cov*

numeric matrix*

KEGG ids

samples

co verages

$cpm*

numeric matrix*

KEGG ids

samples

covs. / 10^6 reads

$tpm*

numeric matrix*

KEGG ids

samples

tpm

** $copy_n umber**

numeric matrix*

KEGG ids

samples

avg. copies

$COG*

$ abund

numeric matrix*

COG ids

samples

abu ndances (reads)

$ bases

numeric matrix*

COG ids

samples

abu ndances (bases)

$cov*

numeric matrix*

COG ids

samples

co verages

$cpm*

numeric matrix*

COG ids

samples

covs. / 10^6 reads

$tpm*

numeric matrix*

COG ids

samples

tpm

** $copy_n umber**

numeric matrix*

COG ids

samples

avg. copies

** $PFAM**

$ abund

numeric matrix*

PFAM ids

samples

abu ndances (reads)

$ bases

numeric matrix*

PFAM ids

samples

abu ndances (bases)

$cov*

numeric matrix*

PFAM ids

samples

co verages

$cpm*

numeric matrix*

PFAM ids

samples

covs. / 10^6 reads

$tpm*

numeric matrix*

PFAM ids

samples

tpm

** $copy_n umber**

numeric matrix*

PFAM ids

samples

avg. copies

** $total_ reads**

numeric vector*

samples

(n/a)

total reads

** $misc**

$ project _name

ch aracter vector

(empty)

(n/a)

project name

$sa mples

ch aracter vector

(empty)

(n/a)

samples

$ta x_names _long

$ superki ngdom

ch aracter vector

short names

(n/a)

full names

$p hylum

ch aracter vector

short names

(n/a)

full names

$ class

ch aracter vector

short names

(n/a)

full names

$ order

ch aracter vector

short names

(n/a)

full names

$f amily

ch aracter vector

short names

(n/a)

full names

$ genus

ch aracter vector

short names

(n/a)

full names

$sp ecies

ch aracter vector

short names

(n/a)

full names

$tax _names_ short

ch aracter vector

full names

(n/a)

short names

$KEGG_ names*

ch aracter vector

KEGG ids

(n/a)

KEGG names

$KEGG_ paths*

ch aracter vector

KEGG ids

(n/a)

KEGG hi ararchy

$COG_ names

ch aracter vector

COG ids

(n/a)

COG names

$COG_ paths

ch aracter vector

COG ids

(n/a)

COG hi erarchy

$ext_a nnot_so urces*

ch aracter vector

COG ids

(n/a)

e xternal da tabases

If external databases for functional classification were provided to SqueezeMeta via the -extdb argument, the corresponding abundance (reads and bases), coverages, tpm and copy number profiles will be present in SQM$functions (e.g. results for the CAZy database would be present in SQM$functions$CAZy). Additionally, the extended names of the features present in the external database will be present in SQM$misc (e.g. SQM$misc$CAZy_names).

Examples

## Not run:
## (outside R)
## Run SqueezeMeta on the test data.
 /path/to/SqueezeMeta/scripts/SqueezeMeta.pl -p Hadza -f raw -m coassembly -s test.samples
## Now go into R.
library(SQMtools)
Hadza = loadSQM("Hadza") # Where Hadza is the path to the SqueezeMeta output directory.

## End(Not run)

data(Hadza) # We will illustrate the structure of the SQM object on the test data
# Which are the ten most abundant KEGG IDs in our data?
topKEGG = names(sort(rowSums(Hadza$functions$KEGG$tpm), decreasing=TRUE))[1:11]
topKEGG = topKEGG[topKEGG!="Unclassified"]
# Which functions do those KEGG IDs represent?
Hadza$misc$KEGG_names[topKEGG]
# What is the relative abundance of the Negativicutes class across samples?
Hadza$taxa$class$percent["Negativicutes",]
# Which information is stored in the orf, contig and bin tables?
colnames(Hadza$orfs$table)
colnames(Hadza$contigs$table)
colnames(Hadza$bins$table)
# What is the GC content distribution of my metagenome?
boxplot(Hadza$contigs$table[,"GC perc"]) # Not weighted by contig length or abundance!