loadSQM
loadSQM |
R Documentation |
Load a SqueezeMeta project into R
Description
This function takes the path to a project directory generated by
SqueezeMeta (whose name is
specified in the -p parameter of the SqueezeMeta.pl script) and
parses the results into a SQM object. Alternatively, it can load the
project data from a zip file produced by sqm2zip.py.
Usage
loadSQM(
project_path,
tax_mode = "prokfilter",
tax_source = "contigs",
trusted_functions_only = FALSE,
single_copy_genes = "MGOGs",
load_sequences = TRUE,
engine = "data.table"
)
Arguments
|
character, a vector of project directories generated
by SqueezeMeta, and/or zip files generated by
|
|
character, which taxonomic classification should be
loaded? SqueezeMeta applies the identity thresholds
described in Luo et al.,
2014.
Use |
|
character, source data used for the taxonomy tables
present in |
|
logical. If |
|
character, source of single copy genes for copy
number normalization, either |
|
logical. If |
|
character. Engine used to load the ORFs and contigs
tables. Either |
Value
SQM object containing the parsed project. If more than one path is
provided in project_path this function will return a SQMbunch object
instead. The structure of this object is similar to that of a SQMlite
object (see loadSQMlite) but with an extra entry named projects
that contains one SQM object for input project. SQM and SQMbunch objects
will otherwise behave similarly when used with the subset and plot
functions from this package.
Prerequisites
Run SqueezeMeta! An example call for running it would be:
/path/to/SqueezeMeta/scripts/SqueezeMeta.pl-m coassembly -f fastq_dir -s samples_file -p project_dirThe SQM object structure
The SQM object is a nested list which contains the following information:
lvl1 |
lvl2 |
lvl3 |
type |
rows/names |
columns |
data |
$orfs |
$table |
dataframe |
orfs |
misc. data |
misc. data |
|
$abund |
numeric matrix |
orfs |
samples |
abundances (reads) |
||
$bases |
numeric matrix |
orfs |
samples |
abundances (bases) |
||
$cov |
numeric matrix |
orfs |
samples |
coverages |
||
$cpm |
numeric matrix |
orfs |
samples |
covs. / 10^6 reads |
||
$tpm |
numeric matrix |
orfs |
samples |
tpm |
||
$seqs |
character vector |
orfs |
(n/a) |
sequences |
||
$tax |
character matrix |
orfs |
tax. ranks |
taxonomy |
||
$tax16S |
character vector |
orfs |
(n/a) |
16S rRNA taxonomy |
||
$tax_abund |
See SQM$taxa |
|||||
$markers |
list |
orfs |
(n/a) |
CheckM1 markers |
||
$contigs |
$table |
dataframe |
contigs |
misc. data |
misc. data |
|
$abund |
numeric matrix |
contigs |
samples |
abundances (reads) |
||
$bases |
numeric matrix |
contigs |
samples |
abundances (bases) |
||
$cov |
numeric matrix |
contigs |
samples |
coverages |
||
$cpm |
numeric matrix |
contigs |
samples |
covs. / 10^6 reads |
||
$tpm |
numeric matrix |
contigs |
samples |
tpm |
||
$seqs |
character vector |
contigs |
(n/a) |
sequences |
||
$tax |
character matrix |
contigs |
tax. ranks |
taxonomies |
||
$tax_abund |
See SQM$taxa |
|||||
$bins |
character matrix |
contigs |
bin. methods |
bins |
||
$bins |
$table |
dataframe |
bins |
misc. data |
misc. data |
|
$length |
numeric vector |
bins |
(n/a) |
length |
||
$abund |
numeric matrix |
bins |
samples |
abundances (reads) |
||
$percent |
numeric matrix |
bins |
samples |
abundances (reads) |
||
$bases |
numeric matrix |
bins |
samples |
abundances (bases) |
||
$cov |
numeric matrix |
bins |
samples |
coverages |
||
$cpm |
numeric matrix |
bins |
samples |
covs. / 10^6 reads |
||
$tax |
character matrix |
bins |
tax. ranks |
taxonomy |
||
$tax_abund |
See SQM$taxa |
|||||
$tax_gtdb |
character matrix |
bins |
tax. ranks |
GTDB taxonomy |
||
$tax_abund_gtdb |
See SQM$taxa |
|||||
$taxa |
$superkingdom |
$abund |
numeric matrix |
superkingdoms |
samples |
abundances (reads) |
$percent |
numeric matrix |
superkingdoms |
samples |
percentages |
||
$phylum |
$abund |
numeric matrix |
phyla |
samples |
abundances (reads) |
|
$percent |
numeric matrix |
phyla |
samples |
percentages |
||
$class |
$abund |
numeric matrix |
classes |
samples |
abundances (reads) |
|
$percent |
numeric matrix |
classes |
samples |
percentages |
||
$order |
$abund |
numeric matrix |
orders |
samples |
abundances (reads) |
|
$percent |
numeric matrix |
orders |
samples |
percentages |
||
$family |
$abund |
numeric matrix |
families |
samples |
abundances (reads) |
|
$percent |
numeric matrix |
families |
samples |
percentages |
||
$genus |
$abund |
numeric matrix |
genera |
samples |
abundances (reads) |
|
$percent |
numeric matrix |
genera |
samples |
percentages |
||
$species |
$abund |
numeric matrix |
species |
samples |
abundances (reads) |
|
$percent |
numeric matrix |
species |
samples |
percentages |
||
$functions |
$KEGG |
$abund |
numeric matrix |
KEGG ids |
samples |
abundances (reads) |
$bases |
numeric matrix |
KEGG ids |
samples |
abundances (bases) |
||
$cov |
numeric matrix |
KEGG ids |
samples |
coverages |
||
$cpm |
numeric matrix |
KEGG ids |
samples |
covs. / 10^6 reads |
||
$tpm |
numeric matrix |
KEGG ids |
samples |
tpm |
||
$copy_number |
numeric matrix |
KEGG ids |
samples |
avg. copies |
||
$COG |
$abund |
numeric matrix |
COG ids |
samples |
abundances (reads) |
|
$bases |
numeric matrix |
COG ids |
samples |
abundances (bases) |
||
$cov |
numeric matrix |
COG ids |
samples |
coverages |
||
$cpm |
numeric matrix |
COG ids |
samples |
covs. / 10^6 reads |
||
$tpm |
numeric matrix |
COG ids |
samples |
tpm |
||
$copy_number |
numeric matrix |
COG ids |
samples |
avg. copies |
||
$PFAM |
$abund |
numeric matrix |
PFAM ids |
samples |
abundances (reads) |
|
$bases |
numeric matrix |
PFAM ids |
samples |
abundances (bases) |
||
$cov |
numeric matrix |
PFAM ids |
samples |
coverages |
||
$cpm |
numeric matrix |
PFAM ids |
samples |
covs. / 10^6 reads |
||
$tpm |
numeric matrix |
PFAM ids |
samples |
tpm |
||
$copy_number |
numeric matrix |
PFAM ids |
samples |
avg. copies |
||
$total_reads |
numeric vector |
samples |
(n/a) |
total reads |
||
$misc |
$project_name |
character vector |
(empty) |
(n/a) |
project name |
|
$samples |
character vector |
(empty) |
(n/a) |
samples |
||
$tax_names_long |
$superkingdom |
character vector |
short names |
(n/a) |
full names |
|
$phylum |
character vector |
short names |
(n/a) |
full names |
||
$class |
character vector |
short names |
(n/a) |
full names |
||
$order |
character vector |
short names |
(n/a) |
full names |
||
$family |
character vector |
short names |
(n/a) |
full names |
||
$genus |
character vector |
short names |
(n/a) |
full names |
||
$species |
character vector |
short names |
(n/a) |
full names |
||
$tax_names_short |
character vector |
full names |
(n/a) |
short names |
||
$KEGG_names |
character vector |
KEGG ids |
(n/a) |
KEGG names |
||
$KEGG_paths |
character vector |
KEGG ids |
(n/a) |
KEGG hiararchy |
||
$COG_names |
character vector |
COG ids |
(n/a) |
COG names |
||
$COG_paths |
character vector |
COG ids |
(n/a) |
COG hierarchy |
||
$ext_annot_sources |
character vector |
COG ids |
(n/a) |
external databases |
If external databases for functional classification were provided to
SqueezeMeta via the -extdb argument, the corresponding abundance
(reads and bases), coverages, tpm and copy number profiles will be
present in SQM$functions (e.g. results for the CAZy database would
be present in SQM$functions$CAZy). Additionally, the extended names
of the features present in the external database will be present in
SQM$misc (e.g. SQM$misc$CAZy_names).
Examples
## Not run:
## (outside R)
## Run SqueezeMeta on the test data.
/path/to/SqueezeMeta/scripts/SqueezeMeta.pl -p Hadza -f raw -m coassembly -s test.samples
## Now go into R.
library(SQMtools)
Hadza = loadSQM("Hadza") # Where Hadza is the path to the SqueezeMeta output directory.
## End(Not run)
data(Hadza) # We will illustrate the structure of the SQM object on the test data
# Which are the ten most abundant KEGG IDs in our data?
topKEGG = names(sort(rowSums(Hadza$functions$KEGG$tpm), decreasing=TRUE))[1:11]
topKEGG = topKEGG[topKEGG!="Unclassified"]
# Which functions do those KEGG IDs represent?
Hadza$misc$KEGG_names[topKEGG]
# What is the relative abundance of the Negativicutes class across samples?
Hadza$taxa$class$percent["Negativicutes",]
# Which information is stored in the orf, contig and bin tables?
colnames(Hadza$orfs$table)
colnames(Hadza$contigs$table)
colnames(Hadza$bins$table)
# What is the GC content distribution of my metagenome?
boxplot(Hadza$contigs$table[,"GC perc"]) # Not weighted by contig length or abundance!