cytopy.flow.sampling¶
For manageable analysis sampling is unavoidable. This module contains all the functionality for downsampling and subsequent upsampling in cytopy. cytopy supports uniform sampling that wraps the Pandas DataFrame sample method. In addition we provide support for density dependent downsampling (adapted from SPADE; https://www.nature.com/articles/nbt.1991) and faithful downsampling (adapted from SamSPECTRAL; https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471210511403).
Copyright 2020 Ross Burton
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Functions:

Perform density dependent downsampling to remove risk of undersampling rare populations; adapted from SPADE* 

Generate an estimation of local density amongst single cell population using the KDTree algorithm from ScikitLearn. 

An implementation of faithful downsampling as described in: Zare H, Shooshtari P, Gupta A, Brinkman R. 

Given local, target and outlier density (as estimated by KNN) calculate the probability of retaining the event. 

Uniform downsampling. 

Perform upsampling in a density dependent manner; neighbourhoods of cells of low density will have a high probability of being upsampled versus dense neighbourhoods. 

Given some sampled dataframe and the original dataframe from which it was derived, use the given labels (which should correspond to the sampled dataframe row index) to fit a nearest neighbours model to the sampled data and predict the assignment of labels in the original data. 

cytopy.flow.sampling.
density_dependent_downsampling
(data: pandas.core.frame.DataFrame, features: Optional[list] = None, sample_size: int = 0.1, alpha: int = 5, distance_metric: str = 'manhattan', tree_sample: float = 0.1, outlier_dens: int = 1, target_dens: int = 5, njobs: int =  1)¶ Perform density dependent downsampling to remove risk of undersampling rare populations; adapted from SPADE*
Extracting a cellular hierarchy from highdimensional cytometry data with SPADE
Peng QiuErin SimondsSean BendallKenneth GibbsRobert BruggnerMichael LindermanKaren SachsGarry NolanSylvia Plevritis  Nature Biotechnology  2011
 Parameters
data (Pandas.DataFrame) – Data to sample
features (list (defaults to all columns)) – Name of columns to be used as features in downsampling algorithm
sample_size (int or float (default=0.1)) – number of events to return in sample, either as an integer of fraction of original sample size
alpha (int, (default=5)) – used for estimating distance threshold between cell and nearest neighbour (default = 5 used in original paper)
distance_metric (str (default="manhattan")) – Metric used for neighbour assignment
tree_sample (float or int, (default=0.1)) – proportion/number of cells to sample for generation of KD tree
outlier_dens (float, (default=1)) – used to exclude cells with the lowest local densities; int value as a percentile of the lowest local densities e.g. 1 (the default value) means the bottom 1% of cells with lowest local densities are regarded as noise
target_dens (float, (default=5)) – determines how many cells will survive the downsampling process; int value as a percentile of the lowest local densities e.g. 5 (the default value) means the density of bottom 5% of cells will serve as the density threshold for rare cell populations
njobs (int (default=1)) – Number of jobs to run in unison when calculating weights (defaults to all available cores)
 Returns
Downsampled pandas dataframe
 Return type
Pandas.DataFrame

cytopy.flow.sampling.
density_probability_assignment
(sample: pandas.core.frame.DataFrame, data: pandas.core.frame.DataFrame, distance_metric: str = 'manhattan', alpha: int = 5, outlier_dens: int = 1, target_dens: int = 5, njobs: int =  1)¶ Generate an estimation of local density amongst single cell population using the KDTree algorithm from ScikitLearn. Using this representation return the probability assignment for retention of each event using prob_downsample. adapted from SPADE*
Extracting a cellular hierarchy from highdimensional cytometry data with SPADE
Peng QiuErin SimondsSean BendallKenneth GibbsRobert BruggnerMichael LindermanKaren SachsGarry NolanSylvia Plevritis  Nature Biotechnology  2011
 Parameters
sample (Pandas.DataFrame) – Downsampled data to use for generating nearest neighbours tree graph
data (Pandas.DataFrame) – Original dataframe
distance_metric (str (default="manhattan")) – Metric used for neighbour assignment
alpha (int) – Used for estimating distance threshold between cell and nearest neighbour (default = 5 used in original paper)
outlier_dens (int, (default=1)) – used to exclude cells with the lowest local densities; float value as a percentile of the lowest local densities e.g. 1 (the default value) means the bottom 1% of cells with lowest local densities are regarded as noise
target_dens (int, (default=5)) – determines how many cells will receive a probability > 0; int value as a percentile of the lowest local densities e.g. 5 (the default value) means the density of bottom 5% of cells will serve as the density threshold for rare cell populations
njobs (int (default=1)) – Controls how many parallel processed to run in KDTree search. Default is 1, which will use all available cores.
 Returns
 Return type
numpy.ndarray

cytopy.flow.sampling.
faithful_downsampling
(data: numpy.array, h: float)¶ An implementation of faithful downsampling as described in: Zare H, Shooshtari P, Gupta A, Brinkman R. Data reduction for spectral clustering to analyze high throughput flow cytometry data. BMC Bioinformatics 2010;11:403
 Parameters
data (numpy.ndarray) – numpy array to be downsampled
h (float) – radius for nearest neighbours search
 Returns
Downsampled array
 Return type
numpy.ndarray

cytopy.flow.sampling.
prob_downsample
(local_d: int, target_d: int, outlier_d: int)¶ Given local, target and outlier density (as estimated by KNN) calculate the probability of retaining the event. If local density is less than or equal to the outlier density, returns a probability of 0 (event will be discarded). If the local density is greater than the outlier density but less than the target density, return a value of 1 (absolutely keep this event). If the local density is greater than the target density, then the probability of retention is the ratio between the target and local density.
 Parameters
local_d (int) –
target_d (int) –
outlier_d (int) –
 Returns
Value between 0 and 1
 Return type
float

cytopy.flow.sampling.
uniform_downsampling
(data: pandas.core.frame.DataFrame, sample_size: int, **kwargs)¶ Uniform downsampling. Wraps the Pandas DataFrame sample method with some additional error handling for when the requested sample size is invalid.
 Parameters
data (Pandas.DataFrame) –
sample_size (int or float) – Size of sample required. If a float is given will return a sample of this proportion.
kwargs – Additional keyword arguments passed to Pandas.DataFrame.sample
 Returns
 Return type
Pandas.DataFrame
 Raises
TypeError – Sample size type is invalid; should be either int or float

cytopy.flow.sampling.
upsample_density
(data: pandas.core.frame.DataFrame, features: Optional[list] = None, upsample_factor: int = 2, sample_size: Optional[int] = None, tree_sample: int = 0.1, distance_metric: str = 'manhattan', alpha: int = 5, outlier_dens: int = 1, target_dens: int = 5, njobs: int =  1)¶ Perform upsampling in a density dependent manner; neighbourhoods of cells of low density will have a high probability of being upsampled versus dense neighbourhoods. Ignores outliers. adapted from SPADE*
Extracting a cellular hierarchy from highdimensional cytometry data with SPADE
Peng QiuErin SimondsSean BendallKenneth GibbsRobert BruggnerMichael LindermanKaren SachsGarry NolanSylvia Plevritis  Nature Biotechnology  2011
 Parameters
data (Pandas.DataFrame) – Data to sample
features (list (defaults to all columns)) – Name of columns to be used as features in downsampling algorithm
sample_size (int or float (default=0.1)) – number of events to return in sample, either as an integer of fraction of original sample size
alpha (int, (default=5)) – used for estimating distance threshold between cell and nearest neighbour (default = 5 used in original paper)
distance_metric (str (default="manhattan")) – Metric used for neighbour assignment
upsample_factor (int (default=2)) – Factor to upsample by (e.g. default=2 would double the observations)
tree_sample (float or int, (default=0.1)) – proportion/number of cells to sample for generation of KD tree
outlier_dens (float, (default=1)) – used to exclude cells with the lowest local densities; int value as a percentile of the lowest local densities e.g. 1 (the default value) means the bottom 1% of cells with lowest local densities are regarded as noise
target_dens (float, (default=5)) – determines how many cells will survive the downsampling process; int value as a percentile of the lowest local densities e.g. 5 (the default value) means the density of bottom 5% of cells will serve as the density threshold for rare cell populations
njobs (int (default=1)) – Number of jobs to run in unison when calculating weights (defaults to all available cores)

cytopy.flow.sampling.
upsample_knn
(sample: pandas.core.frame.DataFrame, original_data: pandas.core.frame.DataFrame, labels: list, features: list, verbose: bool = True, scoring: str = 'balanced_accuracy', **kwargs)¶ Given some sampled dataframe and the original dataframe from which it was derived, use the given labels (which should correspond to the sampled dataframe row index) to fit a nearest neighbours model to the sampled data and predict the assignment of labels in the original data. Uses sklearn.neighbors.KNeighborsClassifier for KNN implementation. If n_neighbors parameter is not provided, will estimate using grid search cross validation. The scoring parameter can be tuned by changing the scoring input (default=”balanced_accuracy”)
 Parameters
sample (Pandas.DataFrame) – Sampled dataframe that has been classified/gated/etc
original_data (Pandas.DataFrame) – Original dataframe prior to sampling (unlabeled)
labels (list) – List of labels (should correspond to the label for each row)
features (list) – List of features (column names)
verbose (bool (default=True)) – If True, will provide feedback to stdout
scoring (str (default="balanced_accuracy")) – Scoring parameter to use for GridSearchCV. Only relevant is n_neighbors parameter is not provided
kwargs (dict) – Additional keyword arguments passed to ScikitLearn’s KNeighborsClassifier
 Returns
Array of labels for original data
 Return type
numpy.ndarray