cytopy.flow.sampling

For manageable analysis sampling is unavoidable. This module contains all the functionality for downsampling and subsequent upsampling in cytopy. cytopy supports uniform sampling that wraps the Pandas DataFrame sample method. In addition we provide support for density dependent downsampling (adapted from SPADE; https://www.nature.com/articles/nbt.1991) and faithful downsampling (adapted from SamSPECTRAL; https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-11-403).

Copyright 2020 Ross Burton

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Functions:

density_dependent_downsampling(data[, …])

Perform density dependent down-sampling to remove risk of under-sampling rare populations; adapted from SPADE*

density_probability_assignment(sample, data)

Generate an estimation of local density amongst single cell population using the KDTree algorithm from Scikit-Learn.

faithful_downsampling(data, h)

An implementation of faithful downsampling as described in: Zare H, Shooshtari P, Gupta A, Brinkman R.

prob_downsample(local_d, target_d, outlier_d)

Given local, target and outlier density (as estimated by KNN) calculate the probability of retaining the event.

uniform_downsampling(data, sample_size, **kwargs)

Uniform downsampling.

upsample_density(data[, features, …])

Perform upsampling in a density dependent manner; neighbourhoods of cells of low density will have a high probability of being upsampled versus dense neighbourhoods.

upsample_knn(sample, original_data, labels, …)

Given some sampled dataframe and the original dataframe from which it was derived, use the given labels (which should correspond to the sampled dataframe row index) to fit a nearest neighbours model to the sampled data and predict the assignment of labels in the original data.

cytopy.flow.sampling.density_dependent_downsampling(data: pandas.core.frame.DataFrame, features: Optional[list] = None, sample_size: int = 0.1, alpha: int = 5, distance_metric: str = 'manhattan', tree_sample: float = 0.1, outlier_dens: int = 1, target_dens: int = 5, njobs: int = - 1)

Perform density dependent down-sampling to remove risk of under-sampling rare populations; adapted from SPADE*

  • Extracting a cellular hierarchy from high-dimensional cytometry data with SPADE

Peng Qiu-Erin Simonds-Sean Bendall-Kenneth Gibbs-Robert Bruggner-Michael Linderman-Karen Sachs-Garry Nolan-Sylvia Plevritis - Nature Biotechnology - 2011

Parameters
  • data (Pandas.DataFrame) – Data to sample

  • features (list (defaults to all columns)) – Name of columns to be used as features in down-sampling algorithm

  • sample_size (int or float (default=0.1)) – number of events to return in sample, either as an integer of fraction of original sample size

  • alpha (int, (default=5)) – used for estimating distance threshold between cell and nearest neighbour (default = 5 used in original paper)

  • distance_metric (str (default="manhattan")) – Metric used for neighbour assignment

  • tree_sample (float or int, (default=0.1)) – proportion/number of cells to sample for generation of KD tree

  • outlier_dens (float, (default=1)) – used to exclude cells with the lowest local densities; int value as a percentile of the lowest local densities e.g. 1 (the default value) means the bottom 1% of cells with lowest local densities are regarded as noise

  • target_dens (float, (default=5)) – determines how many cells will survive the down-sampling process; int value as a percentile of the lowest local densities e.g. 5 (the default value) means the density of bottom 5% of cells will serve as the density threshold for rare cell populations

  • njobs (int (default=-1)) – Number of jobs to run in unison when calculating weights (defaults to all available cores)

Returns

Down-sampled pandas dataframe

Return type

Pandas.DataFrame

cytopy.flow.sampling.density_probability_assignment(sample: pandas.core.frame.DataFrame, data: pandas.core.frame.DataFrame, distance_metric: str = 'manhattan', alpha: int = 5, outlier_dens: int = 1, target_dens: int = 5, njobs: int = - 1)

Generate an estimation of local density amongst single cell population using the KDTree algorithm from Scikit-Learn. Using this representation return the probability assignment for retention of each event using prob_downsample. adapted from SPADE*

  • Extracting a cellular hierarchy from high-dimensional cytometry data with SPADE

Peng Qiu-Erin Simonds-Sean Bendall-Kenneth Gibbs-Robert Bruggner-Michael Linderman-Karen Sachs-Garry Nolan-Sylvia Plevritis - Nature Biotechnology - 2011

Parameters
  • sample (Pandas.DataFrame) – Downsampled data to use for generating nearest neighbours tree graph

  • data (Pandas.DataFrame) – Original dataframe

  • distance_metric (str (default="manhattan")) – Metric used for neighbour assignment

  • alpha (int) – Used for estimating distance threshold between cell and nearest neighbour (default = 5 used in original paper)

  • outlier_dens (int, (default=1)) – used to exclude cells with the lowest local densities; float value as a percentile of the lowest local densities e.g. 1 (the default value) means the bottom 1% of cells with lowest local densities are regarded as noise

  • target_dens (int, (default=5)) – determines how many cells will receive a probability > 0; int value as a percentile of the lowest local densities e.g. 5 (the default value) means the density of bottom 5% of cells will serve as the density threshold for rare cell populations

  • njobs (int (default=-1)) – Controls how many parallel processed to run in KDTree search. Default is -1, which will use all available cores.

Returns

Return type

numpy.ndarray

cytopy.flow.sampling.faithful_downsampling(data: numpy.array, h: float)

An implementation of faithful downsampling as described in: Zare H, Shooshtari P, Gupta A, Brinkman R. Data reduction for spectral clustering to analyze high throughput flow cytometry data. BMC Bioinformatics 2010;11:403

Parameters
  • data (numpy.ndarray) – numpy array to be down-sampled

  • h (float) – radius for nearest neighbours search

Returns

Down-sampled array

Return type

numpy.ndarray

cytopy.flow.sampling.prob_downsample(local_d: int, target_d: int, outlier_d: int)

Given local, target and outlier density (as estimated by KNN) calculate the probability of retaining the event. If local density is less than or equal to the outlier density, returns a probability of 0 (event will be discarded). If the local density is greater than the outlier density but less than the target density, return a value of 1 (absolutely keep this event). If the local density is greater than the target density, then the probability of retention is the ratio between the target and local density.

Parameters
  • local_d (int) –

  • target_d (int) –

  • outlier_d (int) –

Returns

Value between 0 and 1

Return type

float

cytopy.flow.sampling.uniform_downsampling(data: pandas.core.frame.DataFrame, sample_size: int, **kwargs)

Uniform downsampling. Wraps the Pandas DataFrame sample method with some additional error handling for when the requested sample size is invalid.

Parameters
  • data (Pandas.DataFrame) –

  • sample_size (int or float) – Size of sample required. If a float is given will return a sample of this proportion.

  • kwargs – Additional keyword arguments passed to Pandas.DataFrame.sample

Returns

Return type

Pandas.DataFrame

Raises

TypeError – Sample size type is invalid; should be either int or float

cytopy.flow.sampling.upsample_density(data: pandas.core.frame.DataFrame, features: Optional[list] = None, upsample_factor: int = 2, sample_size: Optional[int] = None, tree_sample: int = 0.1, distance_metric: str = 'manhattan', alpha: int = 5, outlier_dens: int = 1, target_dens: int = 5, njobs: int = - 1)

Perform upsampling in a density dependent manner; neighbourhoods of cells of low density will have a high probability of being upsampled versus dense neighbourhoods. Ignores outliers. adapted from SPADE*

  • Extracting a cellular hierarchy from high-dimensional cytometry data with SPADE

Peng Qiu-Erin Simonds-Sean Bendall-Kenneth Gibbs-Robert Bruggner-Michael Linderman-Karen Sachs-Garry Nolan-Sylvia Plevritis - Nature Biotechnology - 2011

Parameters
  • data (Pandas.DataFrame) – Data to sample

  • features (list (defaults to all columns)) – Name of columns to be used as features in down-sampling algorithm

  • sample_size (int or float (default=0.1)) – number of events to return in sample, either as an integer of fraction of original sample size

  • alpha (int, (default=5)) – used for estimating distance threshold between cell and nearest neighbour (default = 5 used in original paper)

  • distance_metric (str (default="manhattan")) – Metric used for neighbour assignment

  • upsample_factor (int (default=2)) – Factor to upsample by (e.g. default=2 would double the observations)

  • tree_sample (float or int, (default=0.1)) – proportion/number of cells to sample for generation of KD tree

  • outlier_dens (float, (default=1)) – used to exclude cells with the lowest local densities; int value as a percentile of the lowest local densities e.g. 1 (the default value) means the bottom 1% of cells with lowest local densities are regarded as noise

  • target_dens (float, (default=5)) – determines how many cells will survive the down-sampling process; int value as a percentile of the lowest local densities e.g. 5 (the default value) means the density of bottom 5% of cells will serve as the density threshold for rare cell populations

  • njobs (int (default=-1)) – Number of jobs to run in unison when calculating weights (defaults to all available cores)

cytopy.flow.sampling.upsample_knn(sample: pandas.core.frame.DataFrame, original_data: pandas.core.frame.DataFrame, labels: list, features: list, verbose: bool = True, scoring: str = 'balanced_accuracy', **kwargs)

Given some sampled dataframe and the original dataframe from which it was derived, use the given labels (which should correspond to the sampled dataframe row index) to fit a nearest neighbours model to the sampled data and predict the assignment of labels in the original data. Uses sklearn.neighbors.KNeighborsClassifier for KNN implementation. If n_neighbors parameter is not provided, will estimate using grid search cross validation. The scoring parameter can be tuned by changing the scoring input (default=”balanced_accuracy”)

Parameters
  • sample (Pandas.DataFrame) – Sampled dataframe that has been classified/gated/etc

  • original_data (Pandas.DataFrame) – Original dataframe prior to sampling (unlabeled)

  • labels (list) – List of labels (should correspond to the label for each row)

  • features (list) – List of features (column names)

  • verbose (bool (default=True)) – If True, will provide feedback to stdout

  • scoring (str (default="balanced_accuracy")) – Scoring parameter to use for GridSearchCV. Only relevant is n_neighbors parameter is not provided

  • kwargs (dict) – Additional keyword arguments passed to Scikit-Learn’s KNeighborsClassifier

Returns

Array of labels for original data

Return type

numpy.ndarray