subsetBins

subsetBins

R Documentation

Create a SQM object containing only the requested bins, and the contigs and ORFs contained in them.

Description

Create a SQM object containing only the requested bins, and the contigs and ORFs contained in them.

Usage

subsetBins(
  SQM,
  bins = NULL,
  rank = NULL,
  tax = NULL,
  min_completeness = NULL,
  max_contamination = NULL,
  tax_source = "bins",
  trusted_functions_only = FALSE,
  ignore_unclassified_functions = FALSE,
  rescale_tpm = TRUE,
  rescale_copy_number = TRUE,
  allow_empty = FALSE
)

Arguments

`SQM`	SQM object to be subsetted.
`bins`	character. Vector of bins to be selected. If provided, will override `rank`, `tax`, `min_completeness` and `max_contamination`.
`rank`	character. The taxonomic rank from which to select the desired taxa (`superkingdom`, `phylum`, `class`, `order`, `family`, `genus`, `species`)
`tax`	character. A taxon or vector of taxa to be selected.
`min_completeness`	numeric. Discard bins with completeness lower than this value (default `NULL`).
`max_contamination`	numeric. Discard bins with contamination higher than this value (default `NULL`).
`tax_source`	character, source data used for taxonomic subsetting (if `rank` and `tax` are provided) and for the aggregate taxonomy tables present in `SQM$taxa`, either `"orfs"`, `"contigs"`, `"bins"` (GTDB bin taxonomy if available, SQM bin taxonomy otherwise), `"bins_gtdb"` (GTDB bin taxonomy) or `"bins_sqm"` (SQM bin taxonomy). If using `bins_gtdb`, note that GTDB taxonomy may differ from the NCBI taxonomy used throughout the rest of SqueezeMeta. Default `"bins"`.
`trusted_functions_only`	logical. If `TRUE`, only highly trusted functional annotations (best hit + best average) will be considered when generating aggregated function tables. If `FALSE`, best hit annotations will be used (default `FALSE`).
`ignore_unclassified_functions`	logical. If `FALSE`, ORFs with no functional classification will be aggregated together into an “Unclassified” category. If `TRUE`, they will be ignored (default `FALSE`).
`rescale_tpm`	logical. If `TRUE`, TPMs for KEGGs, COGs, and PFAMs will be recalculated (so that the TPMs in the subset actually add up to 1 million). Otherwise, per-function TPMs will be calculated by aggregating the TPMs of the ORFs annotated with that function, and will thus keep the scaling present in the parent object. By default it is set to `TRUE`, which means that the returned TPMs will be scaled by million of reads of the selected bins.
`rescale_copy_number`	logical. If `TRUE`, copy numbers with be recalculated using the median single-copy gene coverages in the subset. Otherwise, single-copy gene coverages will be taken from the parent object. By default it is set to `TRUE`, which means that the returned copy numbers for each function will represent the average copy number of that function per genome of the selected taxon.
`allow_empty`	(internal use only).

Value

SQM object containing only the requested bins.

See Also

subsetContigs, subsetORFs

Examples

data(Hadza)
# Which are the most complete bins?
topBinNames = rownames(Hadza$bins$table)[order(Hadza$bins$table[,"Completeness"],
                                         decreasing=TRUE)][1:2]
# Subset with the most complete bin.
topBin = subsetBins(Hadza, topBinNames[1])

# Subset with all the bins over 90% completeness
over90 = subsetBins(Hadza, min_completeness = 90)

# Subset with bins from the Phascolarctobacterium genus using SqueezeMeta's taxonomy
phasco = subsetBins(Hadza, tax_source = "bins_sqm", rank = "genus", tax = "Phascolarctobacterium")

# Subset with binsfrom the Bacteroidota phylum using GTDB taxonomy
bact = subsetBins(Hadza, tax_source = "bins_gtdb", rank = "phylum", tax = "Bacteroidota")