********** subsetBins ********** ========== =============== subsetBins R Documentation ========== =============== Create a SQM object containing only the requested bins, and the contigs and ORFs contained in them. --------------------------------------------------------------------------------------------------- Description ~~~~~~~~~~~ Create a SQM object containing only the requested bins, and the contigs and ORFs contained in them. Usage ~~~~~ .. code:: R subsetBins( SQM, bins = NULL, rank = NULL, tax = NULL, min_completeness = NULL, max_contamination = NULL, tax_source = "bins", trusted_functions_only = FALSE, ignore_unclassified_functions = FALSE, rescale_tpm = TRUE, rescale_copy_number = TRUE, allow_empty = FALSE ) Arguments ~~~~~~~~~ +-----------------------------------+----------------------------------+ | ``SQM`` | SQM object to be subsetted. | +-----------------------------------+----------------------------------+ | ``bins`` | character. Vector of bins to be | | | selected. If provided, will | | | override ``rank``, ``tax``, | | | ``min_completeness`` and | | | ``max_contamination``. | +-----------------------------------+----------------------------------+ | ``rank`` | character. The taxonomic rank | | | from which to select the desired | | | taxa (``superkingdom``, | | | ``phylum``, ``class``, | | | ``order``, ``family``, | | | ``genus``, ``species``) | +-----------------------------------+----------------------------------+ | ``tax`` | character. A taxon or vector of | | | taxa to be selected. | +-----------------------------------+----------------------------------+ | ``min_completeness`` | numeric. Discard bins with | | | completeness lower than this | | | value (default ``NULL``). | +-----------------------------------+----------------------------------+ | ``max_contamination`` | numeric. Discard bins with | | | contamination higher than this | | | value (default ``NULL``). | +-----------------------------------+----------------------------------+ | ``tax_source`` | character, source data used for | | | taxonomic subsetting (if | | | ``rank`` and ``tax`` are | | | provided) and for the aggregate | | | taxonomy tables present in | | | ``SQM$taxa``, either ``"orfs"``, | | | ``"contigs"``, ``"bins"`` (GTDB | | | bin taxonomy if available, SQM | | | bin taxonomy otherwise), | | | ``"bins_gtdb"`` (GTDB bin | | | taxonomy) or ``"bins_sqm"`` (SQM | | | bin taxonomy). If using | | | ``bins_gtdb``, note that GTDB | | | taxonomy may differ from the | | | NCBI taxonomy used throughout | | | the rest of SqueezeMeta. Default | | | ``"bins"``. | +-----------------------------------+----------------------------------+ | ``trusted_functions_only`` | logical. If ``TRUE``, only | | | highly trusted functional | | | annotations (best hit + best | | | average) will be considered when | | | generating aggregated function | | | tables. If ``FALSE``, best hit | | | annotations will be used | | | (default ``FALSE``). | +-----------------------------------+----------------------------------+ | ``ignore_unclassified_functions`` | logical. If ``FALSE``, ORFs with | | | no functional classification | | | will be aggregated together into | | | an "Unclassified" category. If | | | ``TRUE``, they will be ignored | | | (default ``FALSE``). | +-----------------------------------+----------------------------------+ | ``rescale_tpm`` | logical. If ``TRUE``, TPMs for | | | KEGGs, COGs, and PFAMs will be | | | recalculated (so that the TPMs | | | in the subset actually add up to | | | 1 million). Otherwise, | | | per-function TPMs will be | | | calculated by aggregating the | | | TPMs of the ORFs annotated with | | | that function, and will thus | | | keep the scaling present in the | | | parent object. By default it is | | | set to ``TRUE``, which means | | | that the returned TPMs will be | | | scaled *by million of reads of | | | the selected bins*. | +-----------------------------------+----------------------------------+ | ``rescale_copy_number`` | logical. If ``TRUE``, copy | | | numbers with be recalculated | | | using the median single-copy | | | gene coverages in the subset. | | | Otherwise, single-copy gene | | | coverages will be taken from the | | | parent object. By default it is | | | set to ``TRUE``, which means | | | that the returned copy numbers | | | for each function will represent | | | the average copy number of that | | | function *per genome of the | | | selected taxon*. | +-----------------------------------+----------------------------------+ | ``allow_empty`` | (internal use only). | +-----------------------------------+----------------------------------+ Value ~~~~~ SQM object containing only the requested bins. See Also ~~~~~~~~ ``subsetContigs``, ``subsetORFs`` Examples ~~~~~~~~ .. code:: R data(Hadza) # Which are the most complete bins? topBinNames = rownames(Hadza$bins$table)[order(Hadza$bins$table[,"Completeness"], decreasing=TRUE)][1:2] # Subset with the most complete bin. topBin = subsetBins(Hadza, topBinNames[1]) # Subset with all the bins over 90% completeness over90 = subsetBins(Hadza, min_completeness = 90) # Subset with bins from the Phascolarctobacterium genus using SqueezeMeta's taxonomy phasco = subsetBins(Hadza, tax_source = "bins_sqm", rank = "genus", tax = "Phascolarctobacterium") # Subset with binsfrom the Bacteroidota phylum using GTDB taxonomy bact = subsetBins(Hadza, tax_source = "bins_gtdb", rank = "phylum", tax = "Bacteroidota")