loadSQM
loadSQM |
R Documentation |
Load a SqueezeMeta project into R
Description
This function takes the path to a project directory generated by
SqueezeMeta (whose name
is specified in the -p parameter of the SqueezeMeta.pl script)
and parses the results into a SQM object. Alternatively, it can load
the project data from a zip file produced by sqm2zip.py.
Usage
loadSQM(
project_path,
tax_mode = "prokfilter",
trusted_functions_only = FALSE,
single_copy_genes = "MGOGs",
load_sequences = TRUE,
engine = "data.table"
)
Arguments
|
character, a vector of project
directories generated by SqueezeMeta,
and/or zip files generated by
|
|
character, which taxonomic
classification should be loaded?
SqueezeMeta applies the identity
thresholds described in Luo et al.,
2014.
Use |
|
logical. If |
|
character, source of single copy genes
for copy number normalization, either
|
|
logical. If |
|
character. Engine used to load the
ORFs and contigs tables. Either
|
Value
SQM object containing the parsed project. If more than one path is
provided in project_path this function will return a SQMbunch
object instead. The structure of this object is similar to that of a
SQMlite object (see loadSQMlite) but with an extra entry named
projects that contains one SQM object for input project. SQM and
SQMbunch objects will otherwise behave similarly when used with the
subset and plot functions from this package.
Prerequisites
Run SqueezeMeta! An example call for running it would be:
/path/to/SqueezeMeta/scripts/SqueezeMeta.pl-m coassembly -f fastq_dir -s samples_file -p project_dirThe SQM object structure
The SQM object is a nested list which contains the following information:
lvl1* |
lvl2* |
lvl3* |
type* |
rows/ names |
co lumns |
data* |
** $orfs** |
$ table |
dat aframe |
orfs |
misc. data |
misc. data |
|
$ abund |
numeric matrix* |
orfs |
samples |
abu ndances (reads) |
||
$ bases |
numeric matrix* |
orfs |
samples |
abu ndances (bases) |
||
$cov* |
numeric matrix* |
orfs |
samples |
co verages |
||
$cpm* |
numeric matrix* |
orfs |
samples |
covs. / 10^6 reads |
||
$tpm* |
numeric matrix* |
orfs |
samples |
tpm |
||
** $seqs** |
ch aracter vector |
orfs |
(n/a) |
se quences |
||
$tax* |
ch aracter matrix |
orfs |
tax. ranks |
t axonomy |
||
$t ax16S |
ch aracter vector |
orfs |
(n/a) |
16S rRNA t axonomy |
||
$ma rkers |
list |
orfs |
(n/a) |
CheckM1 markers |
||
$co ntigs |
$ table |
dat aframe |
contigs |
misc. data |
misc. data |
|
$ abund |
numeric matrix* |
contigs |
samples |
abu ndances (reads) |
||
$ bases |
numeric matrix* |
contigs |
samples |
abu ndances (bases) |
||
$cov* |
numeric matrix* |
contigs |
samples |
co verages |
||
$cpm* |
numeric matrix* |
contigs |
samples |
covs. / 10^6 reads |
||
$tpm* |
numeric matrix* |
contigs |
samples |
tpm |
||
** $seqs** |
ch aracter vector |
contigs |
(n/a) |
se quences |
||
$tax* |
ch aracter matrix |
contigs |
tax. ranks |
tax onomies |
||
** $bins** |
ch aracter matrix |
contigs |
bin. methods |
bins |
||
$bins |
$ table |
dat aframe |
bins |
misc. data |
misc. data |
|
$l ength |
numeric vector* |
bins |
(n/a) |
length |
||
$ abund |
numeric matrix* |
bins |
samples |
abu ndances (reads) |
||
$pe rcent |
numeric matrix* |
bins |
samples |
abu ndances (reads) |
||
$ bases |
numeric matrix* |
bins |
samples |
abu ndances (bases) |
||
$cov* |
numeric matrix* |
bins |
samples |
co verages |
||
$cpm* |
numeric matrix* |
bins |
samples |
covs. / 10^6 reads |
||
$tax* |
ch aracter matrix |
bins |
tax. ranks |
t axonomy |
||
$tax _gtdb |
ch aracter matrix |
bins |
tax. ranks |
GTDB t axonomy |
||
** $taxa** |
$ superki ngdom |
$ abund |
numeric matrix* |
superk ingdoms |
samples |
abu ndances (reads) |
$pe rcent |
numeric matrix* |
superk ingdoms |
samples |
perc entages |
||
$p hylum |
$ abund |
numeric matrix* |
phyla |
samples |
abu ndances (reads) |
|
$pe rcent |
numeric matrix* |
phyla |
samples |
perc entages |
||
$ class |
$ abund |
numeric matrix* |
classes |
samples |
abu ndances (reads) |
|
$pe rcent |
numeric matrix* |
classes |
samples |
perc entages |
||
$ order |
$ abund |
numeric matrix* |
orders |
samples |
abu ndances (reads) |
|
$pe rcent |
numeric matrix* |
orders |
samples |
perc entages |
||
$f amily |
$ abund |
numeric matrix* |
f amilies |
samples |
abu ndances (reads) |
|
$pe rcent |
numeric matrix* |
f amilies |
samples |
perc entages |
||
$ genus |
$ abund |
numeric matrix* |
genera |
samples |
abu ndances (reads) |
|
$pe rcent |
numeric matrix* |
genera |
samples |
perc entages |
||
$sp ecies |
$ abund |
numeric matrix* |
species |
samples |
abu ndances (reads) |
|
$pe rcent |
numeric matrix* |
species |
samples |
perc entages |
||
$func tions |
** $KEGG** |
$ abund |
numeric matrix* |
KEGG ids |
samples |
abu ndances (reads) |
$ bases |
numeric matrix* |
KEGG ids |
samples |
abu ndances (bases) |
||
$cov* |
numeric matrix* |
KEGG ids |
samples |
co verages |
||
$cpm* |
numeric matrix* |
KEGG ids |
samples |
covs. / 10^6 reads |
||
$tpm* |
numeric matrix* |
KEGG ids |
samples |
tpm |
||
** $copy_n umber** |
numeric matrix* |
KEGG ids |
samples |
avg. copies |
||
$COG* |
$ abund |
numeric matrix* |
COG ids |
samples |
abu ndances (reads) |
|
$ bases |
numeric matrix* |
COG ids |
samples |
abu ndances (bases) |
||
$cov* |
numeric matrix* |
COG ids |
samples |
co verages |
||
$cpm* |
numeric matrix* |
COG ids |
samples |
covs. / 10^6 reads |
||
$tpm* |
numeric matrix* |
COG ids |
samples |
tpm |
||
** $copy_n umber** |
numeric matrix* |
COG ids |
samples |
avg. copies |
||
** $PFAM** |
$ abund |
numeric matrix* |
PFAM ids |
samples |
abu ndances (reads) |
|
$ bases |
numeric matrix* |
PFAM ids |
samples |
abu ndances (bases) |
||
$cov* |
numeric matrix* |
PFAM ids |
samples |
co verages |
||
$cpm* |
numeric matrix* |
PFAM ids |
samples |
covs. / 10^6 reads |
||
$tpm* |
numeric matrix* |
PFAM ids |
samples |
tpm |
||
** $copy_n umber** |
numeric matrix* |
PFAM ids |
samples |
avg. copies |
||
** $total_ reads** |
numeric vector* |
samples |
(n/a) |
total reads |
||
** $misc** |
$ project _name |
ch aracter vector |
(empty) |
(n/a) |
project name |
|
$sa mples |
ch aracter vector |
(empty) |
(n/a) |
samples |
||
$ta x_names _long |
$ superki ngdom |
ch aracter vector |
short names |
(n/a) |
full names |
|
$p hylum |
ch aracter vector |
short names |
(n/a) |
full names |
||
$ class |
ch aracter vector |
short names |
(n/a) |
full names |
||
$ order |
ch aracter vector |
short names |
(n/a) |
full names |
||
$f amily |
ch aracter vector |
short names |
(n/a) |
full names |
||
$ genus |
ch aracter vector |
short names |
(n/a) |
full names |
||
$sp ecies |
ch aracter vector |
short names |
(n/a) |
full names |
||
$tax _names_ short |
ch aracter vector |
full names |
(n/a) |
short names |
||
$KEGG_ names* |
ch aracter vector |
KEGG ids |
(n/a) |
KEGG names |
||
$KEGG_ paths* |
ch aracter vector |
KEGG ids |
(n/a) |
KEGG hi ararchy |
||
$COG_ names |
ch aracter vector |
COG ids |
(n/a) |
COG names |
||
$COG_ paths |
ch aracter vector |
COG ids |
(n/a) |
COG hi erarchy |
||
$ext_a nnot_so urces* |
ch aracter vector |
COG ids |
(n/a) |
e xternal da tabases |
||
If external databases for functional classification were provided to
SqueezeMeta via the -extdb argument, the corresponding abundance
(reads and bases), coverages, tpm and copy number profiles will be
present in SQM$functions (e.g. results for the CAZy database
would be present in SQM$functions$CAZy). Additionally, the
extended names of the features present in the external database will
be present in SQM$misc (e.g. SQM$misc$CAZy_names).
Examples
## Not run:
## (outside R)
## Run SqueezeMeta on the test data.
/path/to/SqueezeMeta/scripts/SqueezeMeta.pl -p Hadza -f raw -m coassembly -s test.samples
## Now go into R.
library(SQMtools)
Hadza = loadSQM("Hadza") # Where Hadza is the path to the SqueezeMeta output directory.
## End(Not run)
data(Hadza) # We will illustrate the structure of the SQM object on the test data
# Which are the ten most abundant KEGG IDs in our data?
topKEGG = names(sort(rowSums(Hadza$functions$KEGG$tpm), decreasing=TRUE))[1:11]
topKEGG = topKEGG[topKEGG!="Unclassified"]
# Which functions do those KEGG IDs represent?
Hadza$misc$KEGG_names[topKEGG]
# What is the relative abundance of the Negativicutes class across samples?
Hadza$taxa$class$percent["Negativicutes",]
# Which information is stored in the orf, contig and bin tables?
colnames(Hadza$orfs$table)
colnames(Hadza$contigs$table)
colnames(Hadza$bins$table)
# What is the GC content distribution of my metagenome?
boxplot(Hadza$contigs$table[,"GC perc"]) # Not weighted by contig length or abundance!