cytopy.data.fcs¶

The fcs module houses all functionality for the management and manipulation of data pertaining to a single biological specimen. This might include multiple cytometry files (primary staining and controls) all of which are housed within the FileGroup document. FileGroups should be generated and access through the Experiment class.

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Classes:

FileGroup(*args, **kwargs)

Document representation of a file group; a selection of related fcs files (e.g.

Functions:

`data_loaded`(func)	Decorator that asserts the h5 file corresponding to the FileGroup exists.
`overwrite_or_create`(file, data, key)	Check if node exists in hdf5 file.
`population_in_file`(func)	Wrapper to test if requested population passed to the given function exists in the given h5 file object
`population_stats`(filegroup)	Given a FileGroup generate a DataFrame detailing the number of events, proportion of parent population, and proportion of total (root population) for each population in the FileGroup.
`set_column_names`(df, channels, markers[, …])	Given a dataframe of fcs events and lists of channels and markers, set the column names according to the given preference.

class cytopy.data.fcs.FileGroup(*args, **kwargs)¶

Document representation of a file group; a selection of related fcs files (e.g. a sample and it’s associated controls).

primary_id¶

Unique ID to associate to group

Type: str, required

files¶

List of File objects

Type: EmbeddedDocList

flags¶

Warnings associated to file group

Type: str, optional

notes¶

Additional free text

Type: str, optional

populations¶

Populations derived from this file group

Type: EmbeddedDocList

gates¶

Gate objects that have been applied to this file group

Type: EmbeddedDocList

collection_datetime¶

Date and time of sample collection

Type: DateTime, optional

processing_datetime¶

Date and time of sample processing

Type: DateTime, optional

valid¶

True if FileGroup is valid

Type: BooleanField (default=True)

subject¶

Reference to Subject. If Subject is deleted, this field is nullified but the FileGroup will persist

Type: ReferenceField

Miscellaneous:

`DoesNotExist`
`MultipleObjectsReturned`

Methods:

`add_ctrl_file`(ctrl_id, data, channels, markers)	Add a new control file to this FileGroup.
`add_population`(population)	Add a new Population to this FileGroup.
`delete`([delete_hdf5_file])	Delete FileGroup
`delete_populations`(populations)	Delete given populations.
`get_population`(population_name)	Given the name of a population associated to the FileGroup, returns the Population object, with index and control index ready loaded.
`get_population_by_parent`(parent)	Given the name of some parent population, return a list of Population object whom’s parent matches
`init_new_file`(data, channels, markers)	Under the assumption that this FileGroup has not been previously defined, generate a HDF5 file and initialise the root Population
`list_downstream_populations`(population)	For a given population find all dependencies
`list_populations`()	List population names
`load_ctrl_population_df`(ctrl, population[, …])	Load a population from an associated control.
`load_population_df`(population[, transform, …])	Load the DataFrame for the events pertaining to a single population.
`merge_gate_populations`(left, right[, …])	Merge two populations present in the current population tree.
`merge_non_geom_populations`(populations, …)	Merge multiple populations that are sourced either for classification or clustering methods.
`population_stats`(population[, warn_missing])	Returns a dictionary of statistics (number of events, proportion of parent, and proportion of all events) for the requested population.
`print_population_tree`([image, path])	Print population tree to stdout or save as an image if ‘image’ is True.
`quantile_clean`([upper, lower])	Iterate over every channel in the flow data and cut the upper and lower quartiles.
`save`(args, *kwargs)	Save FileGroup and associated populations
`subtract_populations`(left, right[, …])	Subtract the right population from the left population.
`update_population`(pop)	Replace an existing population.

exception DoesNotExist¶

exception MultipleObjectsReturned¶

add_ctrl_file(ctrl_id: str, data: numpy.array, channels: List[str], markers: List[str])¶

Add a new control file to this FileGroup.

Parameters

ctrl_id (str) – Name of the control e.g (“CD45RA FMO” or “HLA-DR isotype control”
data (numpy.ndarray) – Single cell events data obtained for this control
channels (list) – List of channel names
markers (list) – List of marker names

Returns

Return type

None

Raises

AssertionError – If control already exists

add_population(population: cytopy.data.population.Population)¶

Add a new Population to this FileGroup.

Parameters

population (Population) –

Returns

Return type

None

Raises

DuplicatePopulationError – Population already exists
AssertionError – Population is missing index

delete(delete_hdf5_file: bool = True, *args, **kwargs)¶

Delete FileGroup

Parameters: delete_hdf5_file (bool (default=True)) –
Returns
Return type: None

delete_populations(populations: list) → None¶

Delete given populations. Populations downstream from delete population(s) will also be removed.

Parameters: populations (list or str) – Either a list of populations (list of strings) to remove or a single population as a string. If a value of “all” is given, all populations are dropped.
Returns
Return type: None
Raises: AssertionError – If invalid value given for populations

get_population(population_name: str) → cytopy.data.population.Population ¶

Given the name of a population associated to the FileGroup, returns the Population object, with index and control index ready loaded.

Parameters: population_name (str) – Name of population to retrieve from database
Returns
Return type: Population
Raises: MissingPopulationError – If population doesn’t exist

get_population_by_parent(parent: str) → Generator¶

Given the name of some parent population, return a list of Population object whom’s parent matches

Parameters: parent (str) – Name of the parent population to search for
Returns: List of Populations
Return type: Generator

init_new_file(data: numpy.array, channels: List[str], markers: List[str])¶

Under the assumption that this FileGroup has not been previously defined, generate a HDF5 file and initialise the root Population

Parameters

data (numpy.ndarray) –
channels (list) –
markers (list) –

Returns

Return type

None

list_downstream_populations(population: str) → list¶

For a given population find all dependencies

Parameters: population (str) – population name
Returns: List of populations dependent on given population
Return type: list or None
Raises: AssertionError – If Population does not exist

list_populations() → list¶

List population names

Returns
Return type: List

load_ctrl_population_df(ctrl: str, population: str, classifier: str = 'XGBClassifier', classifier_params: Optional[dict] = None, scoring: str = 'balanced_accuracy', transform: str = 'logicle', transform_kwargs: Optional[dict] = None, verbose: bool = True, evaluate_classifier: bool = True, kfolds: int = 5, n_permutations: int = 25, sample_size: int = 10000) → pandas.core.frame.DataFrame¶

Load a population from an associated control. The assumption here is that control files have been collected at the same time as primary staining and differ by the absence or permutation of a marker/channel/stain. Therefore the population of interest in the primary staining will be used as training data to identify the equivalent population in the control.

The user should specify the control file, the population they want (which MUST already exist in the primary staining) and the type of classifier to use. Additional parameters can be passed to control the classifier and stratified cross validation with permutation testing will be performed if evalidate_classifier is set to True.

Parameters

ctrl (str) – Control file to estimate population for
population (str) – Population of interest. MUST already exist in the primary staining.
classifier (str (default='XGBClassifier')) – Classifier to use. String value should correspond to a valid Scikit-Learn classifier class name or XGBClassifier for XGBoost.
classifier_params (dict, optional) – Additional keyword arguments passed when initiating the classifier
scoring (str (default='balanced_accuracy')) – Method used to evaluate the performance of the classifier if evaluate_classifier is True. String value should be one of the functions of Scikit-Learn’s classification metrics: https://scikit-learn.org/stable/modules/model_evaluation.html.
transform (str (default='logicle')) – Transformation to be applied to data prior to classification
transform_kwargs (dict, optional) – Additional keyword arguments applied to Transformer
verbose (bool (default=True)) – Whether to provide feedback
evaluate_classifier (bool (default=True)) – If True, stratified cross validation with permutating testing is applied prior to predicting control population, feeding back to stdout the performance of the classifier across k folds and n permutations
kfolds (int (default=5)) – Number of cross validation rounds to perform if evaluate_classifier is True
n_permutations (int (default=25)) – Number of rounds of permutation testing to perform if evaluate_classifier is True
sample_size (int (default=10000)) – Number of events to sample from primary data for training

Returns

Return type

Pandas.DataFrame

Raises

AssertionError – If desired population is not found in the primary staining
MissingControlError – If the chosen control does not exist

load_population_df(population: str, transform: str = 'logicle', features_to_transform: Optional[list] = None, transform_kwargs: Optional[dict] = None, label_downstream_affiliations: bool = False) → pandas.core.frame.DataFrame¶

Load the DataFrame for the events pertaining to a single population.

Parameters

population (str) – Name of the desired population
transform (str or dict, optional (default="logicle")) – Transform to be applied; specify a value of None to not perform any transformation
features_to_transform (list, optional) – Features (columns) to be transformed. If not provied, all columns transformed
transform_kwargs (dict, optional) – Additional keyword arguments passed to Transformer
label_downstream_affiliations (bool (default=False)) – If True, an additional column will be generated named “population_label” containing the end node membership of each event e.g. if you choose CD4+ population and there are subsequent populations belonging to this CD4+ population in a tree like: “CD4+ -> CD4+CD25+ -> CD4+CD25+CD45RA+” then the population label column will contain the name of the lowest possible “leaf” population that an event is assigned too.

Returns

Return type

Pandas.DataFrame

Raises

AssertionError – Invalid population, does not exist

merge_gate_populations(left: cytopy.data.population.Population, right: cytopy.data.population.Population, new_population_name: Optional[str] = None)¶

Merge two populations present in the current population tree. The merged population will have the combined index of both populations but will not inherit any clusters and will not be associated to any children downstream of either the left or right population. The population will be added to the tree as a descendant of the left populations parent. New population will be added to FileGroup.

Parameters

left (Population) –
right (Population) –
new_population_name (str (optional)) –

Returns

Return type

None

merge_non_geom_populations(populations: list, new_population_name: str)¶

Merge multiple populations that are sourced either for classification or clustering methods. (Not supported for populations from autonomous gates)

Parameters

populations (list) – List of populations to merge
new_population_name (str) – Name of the new population

Returns

Return type

None

Raises

ValueError – If populations is invalid

population_stats(population: str, warn_missing: bool = False)¶

Returns a dictionary of statistics (number of events, proportion of parent, and proportion of all events) for the requested population.

Parameters

population (str) –
warn_missing (bool (default=False)) –

Returns

Return type

Dict

print_population_tree(image: bool = False, path: Optional[str] = None)¶

Print population tree to stdout or save as an image if ‘image’ is True.

Parameters

image (bool (default=False)) – Save tree as a png image
path (str (optional)) – File path for image, ignored if ‘image’ is False. Defaults to working directory.

Returns

Return type

None

quantile_clean(upper: float = 0.999, lower: float = 0.001)¶

Iterate over every channel in the flow data and cut the upper and lower quartiles.

Parameters

upper (float (default=0.999)) –
lower (float (default=0.001)) –

Returns

Return type

None

save(*args, **kwargs)¶

Save FileGroup and associated populations

Returns
Return type: None

subtract_populations(left: cytopy.data.population.Population, right: cytopy.data.population.Population, new_population_name: Optional[str] = None)¶

Subtract the right population from the left population. The right population must either have the same parent as the left population or be downstream of the left population. The new population will descend from the same parent as the left population. The new population will have a PolygonGeom geom. New population will be added to FileGroup.

Parameters

left (Population) –
right (Population) –
new_population_name (str (optional)) –

Returns

Return type

None

Raises

ValueError – If left and right population are not sourced from root or Gate
AssertionError – If left and right population do not share the same parent or the right population is not downstream of the left population

update_population(pop: cytopy.data.population.Population)¶

Replace an existing population. Population to replace identified using ‘population_name’ field. Note: this method does not allow you to edit the

Parameters: pop (Population) – New population object
Returns
Return type: None

cytopy.data.fcs.data_loaded(func: callable) → callable¶

Decorator that asserts the h5 file corresponding to the FileGroup exists.

Parameters: func (callable) – Function to wrap
Returns: Wrapper function
Return type: callable

cytopy.data.fcs.overwrite_or_create(file: h5py._hl.files.File, data: numpy.ndarray, key: str)¶

Check if node exists in hdf5 file. If it does exist, overwrite with the given array otherwise create a new dataset.

Parameters

file (h5py File object) –
data (Numpy Array) –
key (str) –

Returns

Return type

None

cytopy.data.fcs.population_in_file(func: callable)¶

Wrapper to test if requested population passed to the given function exists in the given h5 file object

Parameters: func (callable) – Function to wrap
Returns
Return type: callable

cytopy.data.fcs.population_stats(filegroup: cytopy.data.fcs.FileGroup) → pandas.core.frame.DataFrame¶

Given a FileGroup generate a DataFrame detailing the number of events, proportion of parent population, and proportion of total (root population) for each population in the FileGroup.

Parameters: filegroup (FileGroup) –
Returns
Return type: Pandas.DataFrame

cytopy.data.fcs.set_column_names(df: pandas.core.frame.DataFrame, channels: list, markers: list, preference: str = 'markers')¶

Given a dataframe of fcs events and lists of channels and markers, set the column names according to the given preference.

Parameters

df (pd.DataFrame) –
channels (list) –
markers (list) –
preference (str) – Valid values are: ‘markers’ or ‘channels’

Returns

Return type

Pandas.DataFrame

Raises

AssertionError – Preference must be either ‘markers’ or ‘channels’