cytopy.data.fcs

The fcs module houses all functionality for the management and manipulation of data pertaining to a single biological specimen. This might include multiple cytometry files (primary staining and controls) all of which are housed within the FileGroup document. FileGroups should be generated and access through the Experiment class.

Copyright 2020 Ross Burton

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Classes:

FileGroup(*args, **kwargs)

Document representation of a file group; a selection of related fcs files (e.g.

Functions:

data_loaded(func)

Decorator that asserts the h5 file corresponding to the FileGroup exists.

overwrite_or_create(file, data, key)

Check if node exists in hdf5 file.

population_in_file(func)

Wrapper to test if requested population passed to the given function exists in the given h5 file object

population_stats(filegroup)

Given a FileGroup generate a DataFrame detailing the number of events, proportion of parent population, and proportion of total (root population) for each population in the FileGroup.

set_column_names(df, channels, markers[, …])

Given a dataframe of fcs events and lists of channels and markers, set the column names according to the given preference.

class cytopy.data.fcs.FileGroup(*args, **kwargs)

Document representation of a file group; a selection of related fcs files (e.g. a sample and it’s associated controls).

primary_id

Unique ID to associate to group

Type

str, required

files

List of File objects

Type

EmbeddedDocList

flags

Warnings associated to file group

Type

str, optional

notes

Additional free text

Type

str, optional

populations

Populations derived from this file group

Type

EmbeddedDocList

gates

Gate objects that have been applied to this file group

Type

EmbeddedDocList

collection_datetime

Date and time of sample collection

Type

DateTime, optional

processing_datetime

Date and time of sample processing

Type

DateTime, optional

valid

True if FileGroup is valid

Type

BooleanField (default=True)

subject

Reference to Subject. If Subject is deleted, this field is nullified but the FileGroup will persist

Type

ReferenceField

Miscellaneous:

DoesNotExist

MultipleObjectsReturned

Methods:

add_ctrl_file(ctrl_id, data, channels, markers)

Add a new control file to this FileGroup.

add_population(population)

Add a new Population to this FileGroup.

delete([delete_hdf5_file])

Delete FileGroup

delete_populations(populations)

Delete given populations.

get_population(population_name)

Given the name of a population associated to the FileGroup, returns the Population object, with index and control index ready loaded.

get_population_by_parent(parent)

Given the name of some parent population, return a list of Population object whom’s parent matches

init_new_file(data, channels, markers)

Under the assumption that this FileGroup has not been previously defined, generate a HDF5 file and initialise the root Population

list_downstream_populations(population)

For a given population find all dependencies

list_populations()

List population names

load_ctrl_population_df(ctrl, population[, …])

Load a population from an associated control.

load_population_df(population[, transform, …])

Load the DataFrame for the events pertaining to a single population.

merge_gate_populations(left, right[, …])

Merge two populations present in the current population tree.

merge_non_geom_populations(populations, …)

Merge multiple populations that are sourced either for classification or clustering methods.

population_stats(population[, warn_missing])

Returns a dictionary of statistics (number of events, proportion of parent, and proportion of all events) for the requested population.

print_population_tree([image, path])

Print population tree to stdout or save as an image if ‘image’ is True.

quantile_clean([upper, lower])

Iterate over every channel in the flow data and cut the upper and lower quartiles.

save(*args, **kwargs)

Save FileGroup and associated populations

subtract_populations(left, right[, …])

Subtract the right population from the left population.

update_population(pop)

Replace an existing population.

exception DoesNotExist
exception MultipleObjectsReturned
add_ctrl_file(ctrl_id: str, data: numpy.array, channels: List[str], markers: List[str])

Add a new control file to this FileGroup.

Parameters
  • ctrl_id (str) – Name of the control e.g (“CD45RA FMO” or “HLA-DR isotype control”

  • data (numpy.ndarray) – Single cell events data obtained for this control

  • channels (list) – List of channel names

  • markers (list) – List of marker names

Returns

Return type

None

Raises

AssertionError – If control already exists

add_population(population: cytopy.data.population.Population)

Add a new Population to this FileGroup.

Parameters

population (Population) –

Returns

Return type

None

Raises
delete(delete_hdf5_file: bool = True, *args, **kwargs)

Delete FileGroup

Parameters

delete_hdf5_file (bool (default=True)) –

Returns

Return type

None

delete_populations(populations: list)None

Delete given populations. Populations downstream from delete population(s) will also be removed.

Parameters

populations (list or str) – Either a list of populations (list of strings) to remove or a single population as a string. If a value of “all” is given, all populations are dropped.

Returns

Return type

None

Raises

AssertionError – If invalid value given for populations

get_population(population_name: str)cytopy.data.population.Population

Given the name of a population associated to the FileGroup, returns the Population object, with index and control index ready loaded.

Parameters

population_name (str) – Name of population to retrieve from database

Returns

Return type

Population

Raises

MissingPopulationError – If population doesn’t exist

get_population_by_parent(parent: str)Generator

Given the name of some parent population, return a list of Population object whom’s parent matches

Parameters

parent (str) – Name of the parent population to search for

Returns

List of Populations

Return type

Generator

init_new_file(data: numpy.array, channels: List[str], markers: List[str])

Under the assumption that this FileGroup has not been previously defined, generate a HDF5 file and initialise the root Population

Parameters
  • data (numpy.ndarray) –

  • channels (list) –

  • markers (list) –

Returns

Return type

None

list_downstream_populations(population: str)list

For a given population find all dependencies

Parameters

population (str) – population name

Returns

List of populations dependent on given population

Return type

list or None

Raises

AssertionError – If Population does not exist

list_populations()list

List population names

Returns

Return type

List

load_ctrl_population_df(ctrl: str, population: str, classifier: str = 'XGBClassifier', classifier_params: Optional[dict] = None, scoring: str = 'balanced_accuracy', transform: str = 'logicle', transform_kwargs: Optional[dict] = None, verbose: bool = True, evaluate_classifier: bool = True, kfolds: int = 5, n_permutations: int = 25, sample_size: int = 10000)pandas.core.frame.DataFrame

Load a population from an associated control. The assumption here is that control files have been collected at the same time as primary staining and differ by the absence or permutation of a marker/channel/stain. Therefore the population of interest in the primary staining will be used as training data to identify the equivalent population in the control.

The user should specify the control file, the population they want (which MUST already exist in the primary staining) and the type of classifier to use. Additional parameters can be passed to control the classifier and stratified cross validation with permutation testing will be performed if evalidate_classifier is set to True.

Parameters
  • ctrl (str) – Control file to estimate population for

  • population (str) – Population of interest. MUST already exist in the primary staining.

  • classifier (str (default='XGBClassifier')) – Classifier to use. String value should correspond to a valid Scikit-Learn classifier class name or XGBClassifier for XGBoost.

  • classifier_params (dict, optional) – Additional keyword arguments passed when initiating the classifier

  • scoring (str (default='balanced_accuracy')) – Method used to evaluate the performance of the classifier if evaluate_classifier is True. String value should be one of the functions of Scikit-Learn’s classification metrics: https://scikit-learn.org/stable/modules/model_evaluation.html.

  • transform (str (default='logicle')) – Transformation to be applied to data prior to classification

  • transform_kwargs (dict, optional) – Additional keyword arguments applied to Transformer

  • verbose (bool (default=True)) – Whether to provide feedback

  • evaluate_classifier (bool (default=True)) – If True, stratified cross validation with permutating testing is applied prior to predicting control population, feeding back to stdout the performance of the classifier across k folds and n permutations

  • kfolds (int (default=5)) – Number of cross validation rounds to perform if evaluate_classifier is True

  • n_permutations (int (default=25)) – Number of rounds of permutation testing to perform if evaluate_classifier is True

  • sample_size (int (default=10000)) – Number of events to sample from primary data for training

Returns

Return type

Pandas.DataFrame

Raises
  • AssertionError – If desired population is not found in the primary staining

  • MissingControlError – If the chosen control does not exist

load_population_df(population: str, transform: str = 'logicle', features_to_transform: Optional[list] = None, transform_kwargs: Optional[dict] = None, label_downstream_affiliations: bool = False)pandas.core.frame.DataFrame

Load the DataFrame for the events pertaining to a single population.

Parameters
  • population (str) – Name of the desired population

  • transform (str or dict, optional (default="logicle")) – Transform to be applied; specify a value of None to not perform any transformation

  • features_to_transform (list, optional) – Features (columns) to be transformed. If not provied, all columns transformed

  • transform_kwargs (dict, optional) – Additional keyword arguments passed to Transformer

  • label_downstream_affiliations (bool (default=False)) – If True, an additional column will be generated named “population_label” containing the end node membership of each event e.g. if you choose CD4+ population and there are subsequent populations belonging to this CD4+ population in a tree like: “CD4+ -> CD4+CD25+ -> CD4+CD25+CD45RA+” then the population label column will contain the name of the lowest possible “leaf” population that an event is assigned too.

Returns

Return type

Pandas.DataFrame

Raises

AssertionError – Invalid population, does not exist

merge_gate_populations(left: cytopy.data.population.Population, right: cytopy.data.population.Population, new_population_name: Optional[str] = None)

Merge two populations present in the current population tree. The merged population will have the combined index of both populations but will not inherit any clusters and will not be associated to any children downstream of either the left or right population. The population will be added to the tree as a descendant of the left populations parent. New population will be added to FileGroup.

Parameters
Returns

Return type

None

merge_non_geom_populations(populations: list, new_population_name: str)

Merge multiple populations that are sourced either for classification or clustering methods. (Not supported for populations from autonomous gates)

Parameters
  • populations (list) – List of populations to merge

  • new_population_name (str) – Name of the new population

Returns

Return type

None

Raises

ValueError – If populations is invalid

population_stats(population: str, warn_missing: bool = False)

Returns a dictionary of statistics (number of events, proportion of parent, and proportion of all events) for the requested population.

Parameters
  • population (str) –

  • warn_missing (bool (default=False)) –

Returns

Return type

Dict

print_population_tree(image: bool = False, path: Optional[str] = None)

Print population tree to stdout or save as an image if ‘image’ is True.

Parameters
  • image (bool (default=False)) – Save tree as a png image

  • path (str (optional)) – File path for image, ignored if ‘image’ is False. Defaults to working directory.

Returns

Return type

None

quantile_clean(upper: float = 0.999, lower: float = 0.001)

Iterate over every channel in the flow data and cut the upper and lower quartiles.

Parameters
  • upper (float (default=0.999)) –

  • lower (float (default=0.001)) –

Returns

Return type

None

save(*args, **kwargs)

Save FileGroup and associated populations

Returns

Return type

None

subtract_populations(left: cytopy.data.population.Population, right: cytopy.data.population.Population, new_population_name: Optional[str] = None)

Subtract the right population from the left population. The right population must either have the same parent as the left population or be downstream of the left population. The new population will descend from the same parent as the left population. The new population will have a PolygonGeom geom. New population will be added to FileGroup.

Parameters
Returns

Return type

None

Raises
  • ValueError – If left and right population are not sourced from root or Gate

  • AssertionError – If left and right population do not share the same parent or the right population is not downstream of the left population

update_population(pop: cytopy.data.population.Population)

Replace an existing population. Population to replace identified using ‘population_name’ field. Note: this method does not allow you to edit the

Parameters

pop (Population) – New population object

Returns

Return type

None

cytopy.data.fcs.data_loaded(func: callable)callable

Decorator that asserts the h5 file corresponding to the FileGroup exists.

Parameters

func (callable) – Function to wrap

Returns

Wrapper function

Return type

callable

cytopy.data.fcs.overwrite_or_create(file: h5py._hl.files.File, data: numpy.ndarray, key: str)

Check if node exists in hdf5 file. If it does exist, overwrite with the given array otherwise create a new dataset.

Parameters
  • file (h5py File object) –

  • data (Numpy Array) –

  • key (str) –

Returns

Return type

None

cytopy.data.fcs.population_in_file(func: callable)

Wrapper to test if requested population passed to the given function exists in the given h5 file object

Parameters

func (callable) – Function to wrap

Returns

Return type

callable

cytopy.data.fcs.population_stats(filegroup: cytopy.data.fcs.FileGroup)pandas.core.frame.DataFrame

Given a FileGroup generate a DataFrame detailing the number of events, proportion of parent population, and proportion of total (root population) for each population in the FileGroup.

Parameters

filegroup (FileGroup) –

Returns

Return type

Pandas.DataFrame

cytopy.data.fcs.set_column_names(df: pandas.core.frame.DataFrame, channels: list, markers: list, preference: str = 'markers')

Given a dataframe of fcs events and lists of channels and markers, set the column names according to the given preference.

Parameters
  • df (pd.DataFrame) –

  • channels (list) –

  • markers (list) –

  • preference (str) – Valid values are: ‘markers’ or ‘channels’

Returns

Return type

Pandas.DataFrame

Raises

AssertionError – Preference must be either ‘markers’ or ‘channels’