Operations¶

We do not recommend interacting with these functions directly. The core pycytominer API uses these operations internally.

Module containing statistical operations for data processing.

`pycytominer.operations.correlation_threshold` ¶

Module for correlation threshold operation.

The correlation threshold operation list of features such that no two features have a correlation greater than a specified threshold.

`correlation_threshold(population_df, features='infer', samples='all', threshold=0.9, method='pearson')` ¶

Exclude features that have correlations above a certain threshold.

Parameters:

Name	Type	Description	Default
`population_df`	`DataFrame`	DataFrame that includes metadata and observation features.	required
`features`	`list`	List of features present in the population dataframe [default: "infer"] if "infer", then assume cell painting features are those that start with "Cells_", "Nuclei_", or "Cytoplasm_".	`"infer"`
`samples`	`str`	List of samples to perform operation on. The function uses a pd.DataFrame.query() function, so you should structure samples in this fashion. An example is "Metadata_treatment == 'control'" (include all quotes). If "all", use all samples to calculate.	`"all"`
`threshold`		Must be between (0, 1) to exclude features	`0.9`
`method`		indicating which correlation metric to use to test cutoff	`'pearson'`

Returns:

Name	Type	Description
`excluded_features`	`list of str`	List of features to exclude from the population_df.

Source code in pycytominer/operations/correlation_threshold.py

def correlation_threshold(
    population_df, features="infer", samples="all", threshold=0.9, method="pearson"
):
    """Exclude features that have correlations above a certain threshold.

    Parameters
    ----------
    population_df : pandas.core.frame.DataFrame
        DataFrame that includes metadata and observation features.
    features : list, default "infer"
         List of features present in the population dataframe [default: "infer"]
         if "infer", then assume cell painting features are those that start with
         "Cells_", "Nuclei_", or "Cytoplasm_".
    samples : str, default "all"
        List of samples to perform operation on. The function uses a pd.DataFrame.query()
        function, so you should  structure samples in this fashion. An example is
        "Metadata_treatment == 'control'" (include all quotes).
        If "all", use all samples to calculate.
    threshold - float, default 0.9
        Must be between (0, 1) to exclude features
    method - str, default "pearson"
        indicating which correlation metric to use to test cutoff

    Returns
    -------
    excluded_features : list of str
         List of features to exclude from the population_df.
    """
    # Check that the input method is supported
    method = check_correlation_method(method)

    if not 0 <= threshold <= 1:
        raise ValueError("threshold variable must be between (0 and 1)")

    # Subset dataframe and calculate correlation matrix across subset features
    if samples != "all":
        population_df.query(samples, inplace=True)

    if features == "infer":
        features = infer_cp_features(population_df)

    population_df = population_df.loc[:, features]

    # Get correlation matrix and lower triangle of pairwise correlations in long format
    data_cor_df, pairwise_df = get_pairwise_correlation(
        population_df=population_df, method=method
    )

    # Get absolute sum of correlation across features
    # The lower the index, the less correlation to the full data frame
    # We want to drop features with highest correlation, so drop higher index
    variable_cor_sum = data_cor_df.abs().sum().sort_values().index

    # And subset to only variable combinations that pass the threshold
    pairwise_df = pairwise_df.query("correlation > @threshold")

    # Return an empty list if nothing is over correlation threshold
    if pairwise_df.shape[0] == 0:
        return []

    # Output the excluded features
    excluded = pairwise_df.apply(
        lambda x: determine_high_cor_pair(x, variable_cor_sum), axis="columns"
    )

    return list(set(excluded.tolist()))

`determine_high_cor_pair(correlation_row, sorted_correlation_pairs)` ¶

Select highest correlated variable given a correlation row.

From a row with columns: ["pair_a", "pair_b", "correlation"]. For use in a pandas.apply().

Parameters:

Name	Type	Description	Default
`correlation_row`	`series`	Pandas series of the specific feature in the pairwise_df	required
`sorted_correlation_pairs`	`index`	A sorted object by total correlative sum to all other features	required

Returns:

Type	Description
`The feature that has a lower total correlation sum with all other features`

Source code in pycytominer/operations/correlation_threshold.py

def determine_high_cor_pair(correlation_row, sorted_correlation_pairs):
    """Select highest correlated variable given a correlation row.

    From a row with columns: ["pair_a", "pair_b", "correlation"]. For use in a pandas.apply().

    Parameters
    ----------
    correlation_row : pandas.core.series.series
        Pandas series of the specific feature in the pairwise_df
    sorted_correlation_pairs : pandas.DataFrame.index
        A sorted object by total correlative sum to all other features

    Returns
    -------
    The feature that has a lower total correlation sum with all other features
    """
    pair_a = correlation_row["pair_a"]
    pair_b = correlation_row["pair_b"]

    if sorted_correlation_pairs.get_loc(pair_a) > sorted_correlation_pairs.get_loc(
        pair_b
    ):
        return pair_a
    else:
        return pair_b

`pycytominer.operations.get_na_columns` ¶

Function to get columns with NA values above a certain threshold.

Remove variables with specified threshold of NA values Note: This was called drop_na_columns in cytominer for R.

`get_na_columns(population_df, features='infer', samples='all', cutoff=0.05)` ¶

Get features that have more NA values than cutoff defined.

Parameters:

Name	Type	Description	Default
`population_df`	`DataFrame`	DataFrame that includes metadata and observation features.	required
`features`	`list`	List of features present in the population dataframe [default: "infer"] if "infer", then assume cell painting features are those that start with "Cells_", "Nuclei_", or "Cytoplasm_".	`"infer"`
`samples`	`str`	List of samples to perform operation on. The function uses a pd.DataFrame.query() function, so you should structure samples in this fashion. An example is "Metadata_treatment == 'control'" (include all quotes). If "all", use all samples to calculate.	`"all"`
`cutoff`	`float`	Exclude features that have a certain proportion of missingness	`0.05`

Returns:

Name	Type	Description
`excluded_features`	`list of str`	List of features to exclude from the population_df.

Source code in pycytominer/operations/get_na_columns.py

def get_na_columns(population_df, features="infer", samples="all", cutoff=0.05):
    """Get features that have more NA values than cutoff defined.

    Parameters
    ----------
    population_df : pandas.core.frame.DataFrame
        DataFrame that includes metadata and observation features.
    features : list, default "infer"
         List of features present in the population dataframe [default: "infer"]
         if "infer", then assume cell painting features are those that start with
         "Cells_", "Nuclei_", or "Cytoplasm_".
    samples : str, default "all"
        List of samples to perform operation on. The function uses a pd.DataFrame.query()
        function, so you should  structure samples in this fashion. An example is
        "Metadata_treatment == 'control'" (include all quotes).
        If "all", use all samples to calculate.
    cutoff : float
        Exclude features that have a certain proportion of missingness

    Returns
    -------
    excluded_features : list of str
         List of features to exclude from the population_df.
    """
    if samples != "all":
        population_df.query(samples, inplace=True)

    if features == "infer":
        features = infer_cp_features(population_df)
    else:
        population_df = population_df.loc[:, features]

    num_rows = population_df.shape[0]
    na_prop_df = population_df.isna().sum() / num_rows

    na_prop_df = na_prop_df[na_prop_df > cutoff]
    return list(set(na_prop_df.index.tolist()))

`pycytominer.operations.noise_removal` ¶

Remove noisy features, as defined by features with excessive standard deviation within the same perturbation group.

`noise_removal(population_df, noise_removal_perturb_groups, features='infer', samples='all', noise_removal_stdev_cutoff=0.8)` ¶

Remove features with excessive standard deviation within the same perturbation group.

Parameters:

Name	Type	Description	Default
`population_df`	`DataFrame`	DataFrame that includes metadata and observation features.	required
`noise_removal_perturb_groups`	`list or array of str`	The list of unique perturbations corresponding to the rows in population_df. For example, perturb1_well1 and perturb1_well2 would both be "perturb1".	required
`features`	`list`	List of features present in the population dataframe [default: "infer"] if "infer", then assume cell painting features are those that start with "Cells_", "Nuclei_", or "Cytoplasm_".	`"infer"`
`samples`	`str`	List of samples to perform operation on. The function uses a pd.DataFrame.query() function, so you should structure samples in this fashion. An example is "Metadata_treatment == 'control'" (include all quotes). If "all", use all samples to calculate.	`"all"`
`noise_removal_stdev_cutoff`	`float`	Maximum mean stdev value for a feature to be kept, with features grouped according to the perturbations in noise_removal_perturbation_groups.	`0.8`

Returns:

Name	Type	Description
`to_remove`	`list`	A list of features to be removed, due to having too high standard deviation within replicate groups.

Source code in pycytominer/operations/noise_removal.py

def noise_removal(
    population_df,
    noise_removal_perturb_groups,
    features="infer",
    samples="all",
    noise_removal_stdev_cutoff=0.8,
):
    """Remove features with excessive standard deviation within the same perturbation group.

    Parameters
    ----------
    population_df : pandas.core.frame.DataFrame
        DataFrame that includes metadata and observation features.
    noise_removal_perturb_groups : list or array of str
        The list of unique perturbations corresponding to the rows in population_df. For example,
        perturb1_well1 and perturb1_well2 would both be "perturb1".
    features : list, default "infer"
         List of features present in the population dataframe [default: "infer"]
         if "infer", then assume cell painting features are those that start with
         "Cells_", "Nuclei_", or "Cytoplasm_".
    samples : str, default "all"
        List of samples to perform operation on. The function uses a pd.DataFrame.query()
        function, so you should  structure samples in this fashion. An example is
        "Metadata_treatment == 'control'" (include all quotes).
        If "all", use all samples to calculate.
    noise_removal_stdev_cutoff : float
        Maximum mean stdev value for a feature to be kept, with features grouped according to the perturbations in
        noise_removal_perturbation_groups.

    Returns
    -------
    to_remove : list
        A list of features to be removed, due to having too high standard deviation within replicate groups.

    """
    # Subset dataframe
    if samples != "all":
        population_df.query(samples, inplace=True)

    if features == "infer":
        features = infer_cp_features(population_df)

    # If a metadata column name is specified, use that as the perturb groups
    if isinstance(noise_removal_perturb_groups, str):
        if noise_removal_perturb_groups not in population_df.columns:
            raise ValueError(
                'f"{perturb} not found. Are you sure it is a ' "metadata column?"
            )
        group_info = population_df[noise_removal_perturb_groups]

    # Otherwise, the user specifies a list of perturbs
    elif isinstance(noise_removal_perturb_groups, list):
        if not len(noise_removal_perturb_groups) == len(population_df):
            raise ValueError(
                f"The length of input list: {len(noise_removal_perturb_groups)} is not equivalent to your "
                f"data: {population_df.shape[0]}"
            )
        group_info = noise_removal_perturb_groups
    else:
        raise TypeError(
            "noise_removal_perturb_groups must be a list corresponding to row perturbations or a str \
                        specifying the name of the metadata column."
        )

    # Subset and df and assign each row with the identity of its perturbation group
    population_df = population_df.loc[:, features]
    population_df = population_df.assign(group_id=group_info)

    # Get the standard deviations of features within each group
    stdev_means_df = population_df.groupby("group_id").std(ddof=0).mean()

    # Identify noisy features with a greater mean stdev within perturbation group than the threshold
    to_remove = stdev_means_df[
        stdev_means_df > noise_removal_stdev_cutoff
    ].index.tolist()

    return to_remove

`pycytominer.operations.transform` ¶

Transform observation variables by specified groups.

References

.. [1] Kessy et al. 2016 "Optimal Whitening and Decorrelation" arXiv: https://arxiv.org/abs/1512.00809

`RobustMAD` ¶

Bases: BaseEstimator, TransformerMixin

Class to perform a "Robust" normalization with respect to median and mad.

scaled = (x - median) / mad

Attributes:

Name	Type	Description
`epsilon`	`float`	fudge factor parameter

Source code in pycytominer/operations/transform.py

class RobustMAD(BaseEstimator, TransformerMixin):
    """Class to perform a "Robust" normalization with respect to median and mad.

        scaled = (x - median) / mad

    Attributes
    ----------
    epsilon : float
        fudge factor parameter
    """

    def __init__(self, epsilon=1e-18):
        self.epsilon = epsilon

    def fit(self, X, y=None):
        """Compute the median and mad to be used for later scaling.

        Parameters
        ----------
        X : pandas.core.frame.DataFrame
            dataframe to fit RobustMAD transform

        Returns
        -------
        self
            With computed median and mad attributes
        """
        # Get the mean of the features (columns) and center if specified
        self.median = X.median()
        # The scale param is required to preserve previous behavior. More info at:
        # https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.median_absolute_deviation.html#scipy.stats.median_absolute_deviation
        self.mad = pd.Series(
            median_abs_deviation(X, nan_policy="omit", scale=1 / 1.4826),
            index=self.median.index,
        )
        return self

    def transform(self, X, copy=None):
        """Apply the RobustMAD calculation.

        Parameters
        ----------
        X : pandas.core.frame.DataFrame
            dataframe to fit RobustMAD transform

        Returns
        -------
        pandas.core.frame.DataFrame
            RobustMAD transformed dataframe
        """
        return (X - self.median) / (self.mad + self.epsilon)

`fit(X, y=None)` ¶

Compute the median and mad to be used for later scaling.

Parameters:

Name	Type	Description	Default
`X`	`DataFrame`	dataframe to fit RobustMAD transform	required

Returns:

Type	Description
`self`	With computed median and mad attributes

Source code in pycytominer/operations/transform.py

def fit(self, X, y=None):
    """Compute the median and mad to be used for later scaling.

    Parameters
    ----------
    X : pandas.core.frame.DataFrame
        dataframe to fit RobustMAD transform

    Returns
    -------
    self
        With computed median and mad attributes
    """
    # Get the mean of the features (columns) and center if specified
    self.median = X.median()
    # The scale param is required to preserve previous behavior. More info at:
    # https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.median_absolute_deviation.html#scipy.stats.median_absolute_deviation
    self.mad = pd.Series(
        median_abs_deviation(X, nan_policy="omit", scale=1 / 1.4826),
        index=self.median.index,
    )
    return self

`transform(X, copy=None)` ¶

Apply the RobustMAD calculation.

Parameters:

Name	Type	Description	Default
`X`	`DataFrame`	dataframe to fit RobustMAD transform	required

Returns:

Type	Description
`DataFrame`	RobustMAD transformed dataframe

Source code in pycytominer/operations/transform.py

def transform(self, X, copy=None):
    """Apply the RobustMAD calculation.

    Parameters
    ----------
    X : pandas.core.frame.DataFrame
        dataframe to fit RobustMAD transform

    Returns
    -------
    pandas.core.frame.DataFrame
        RobustMAD transformed dataframe
    """
    return (X - self.median) / (self.mad + self.epsilon)

`Spherize` ¶

Bases: BaseEstimator, TransformerMixin

Class to apply a sphering transform (aka whitening) data in the base sklearn transform API.

This implementation is modified/inspired from the following sources: 1) A custom function written by Juan C. Caicedo 2) A custom ZCA function at https://github.com/mwv/zca 3) Notes from Niranj Chandrasekaran (https://github.com/cytomining/pycytominer/issues/90) 4) The R package "whitening" written by Strimmer et al (http://strimmerlab.org/software/whitening/) 5) Kessy et al. 2016 "Optimal Whitening and Decorrelation" [1]_.

Attributes:

Name	Type	Description
`epsilon`	`float`	fudge factor parameter
`center`	`bool`	option to center the input X matrix
`method`	`str`	a string indicating which class of sphering to perform

Source code in pycytominer/operations/transform.py

class Spherize(BaseEstimator, TransformerMixin):
    """Class to apply a sphering transform (aka whitening) data in the base sklearn transform API.

    This implementation is modified/inspired from the following sources:
    1) A custom function written by Juan C. Caicedo
    2) A custom ZCA function at https://github.com/mwv/zca
    3) Notes from Niranj Chandrasekaran (https://github.com/cytomining/pycytominer/issues/90)
    4) The R package "whitening" written by Strimmer et al (http://strimmerlab.org/software/whitening/)
    5) Kessy et al. 2016 "Optimal Whitening and Decorrelation" [1]_.

    Attributes
    ----------
    epsilon : float
        fudge factor parameter
    center : bool
        option to center the input X matrix
    method : str
        a string indicating which class of sphering to perform
    """

    def __init__(self, epsilon=1e-6, center=True, method="ZCA", return_numpy=False):
        """Construct a Spherize object.

        Parameters
        ----------
        epsilon : float, default 1e-6
            fudge factor parameter
        center : bool, default True
            option to center the input X matrix
        method : str, default "ZCA"
            a string indicating which class of sphering to perform
        return_numpy: bool, default False
            option to return ndarray, instead of dataframe
        """
        avail_methods = ["PCA", "ZCA", "PCA-cor", "ZCA-cor"]

        self.epsilon = epsilon
        self.center = center
        self.return_numpy = return_numpy

        if method not in avail_methods:
            raise ValueError(
                f"Error {method} not supported. Select one of {avail_methods}"
            )
        self.method = method

        # PCA-cor and ZCA-cor require center=True because we assumed we are
        # only ever interested in computing centered Pearson correlation
        # https://stackoverflow.com/questions/23891391/uncentered-pearson-correlation

        if self.method in ["PCA-cor", "ZCA-cor"] and not self.center:
            raise ValueError("PCA-cor and ZCA-cor require center=True")

    def fit(self, X, y=None):
        """Identify the sphering transform given self.X.

        Parameters
        ----------
        X : pandas.core.frame.DataFrame
            dataframe to fit sphering transform

        Returns
        -------
        self
            With computed weights attribute
        """
        # Get Numpy representation of the DataFrame
        X = X.values

        if self.method in ["PCA-cor", "ZCA-cor"]:
            # The projection matrix for PCA-cor and ZCA-cor is the same as the
            # projection matrix for PCA and ZCA, respectively, on the standardized
            # data. So, we first standardize the data, then compute the projection

            self.standard_scaler = StandardScaler().fit(X)
            variances = self.standard_scaler.var_
            if np.any(variances == 0):
                raise ValueError(
                    "Divide by zero error, make sure low variance columns are removed"
                )

            X = self.standard_scaler.transform(X)
        else:
            if self.center:
                self.mean_centerer = StandardScaler(with_mean=True, with_std=False).fit(
                    X
                )
                X = self.mean_centerer.transform(X)

        # Get the number of observations and variables
        n, d = X.shape

        # Checking the rank of matrix X considering the number of samples (n) and features (d).
        r = np.linalg.matrix_rank(X)

        # Case 1: More features than samples (n < d).
        # If centered (mean of each feature subtracted), one dimension becomes dependent, reducing rank to n - 1.
        # If not centered, the max rank is limited by n, as there can't be more independent vectors than samples.
        # Case 2: More samples than features or equal (n >= d).
        # Here, the max rank is limited by d (number of features), assuming each provides unique information.

        if not ((r == d) | (self.center & (r == n - 1)) | (not self.center & (r == n))):
            raise ValueError(
                "The data matrix X is not full rank: n = {n}, d = {d}, r = {r}."
                "Perfect linear dependencies are unusual in data matrices so something seems amiss."
                "Check for linear dependencies in the data and remove them."
            )

        # Get the eigenvalues and eigenvectors of the covariance matrix
        # by computing the SVD on the data matrix
        # https://stats.stackexchange.com/q/134282/8494
        _, Sigma, Vt = np.linalg.svd(X, full_matrices=True)

        # if n <= d then Sigma has shape (n,) so it will need to be expanded to
        # d filled with the value r'th element of Sigma
        if n <= d:
            # Do some error checking
            if Sigma.shape[0] != n:
                error_detail = f"When n <= d, Sigma should have shape (n,) i.e. ({n}, ) but it is {Sigma.shape}"
                context = (
                    "the call to `np.linalg.svd` in `pycytominer.transform.Spherize`"
                )
                raise ValueError(f"{error_detail}. This is likely a bug in {context}.")

            if (r != n - 1) & (r != n):
                error_detail = f"When n <= d, the rank should be n - 1 or n i.e. {n - 1} or {n} but it is {r}"
                context = (
                    "the call to `np.linalg.svd` in `pycytominer.transform.Spherize`"
                )
                raise ValueError(f"{error_detail}. This is likely a bug in {context}.")

            Sigma = np.concatenate((Sigma[0:r], np.repeat(Sigma[r - 1], d - r)))

        Sigma = Sigma + self.epsilon

        # From https://arxiv.org/abs/1512.00809, the PCA whitening matrix is
        #
        # W = Lambda^(-1/2) E^T
        #
        # where
        # - E is the eigenvector matrix
        # - Lambda is the (diagonal) eigenvalue matrix
        # of cov(X), where X is the data matrix
        # (i.e., cov(X) = E Lambda E^T)
        #
        # However, W can also be computed using the Singular Value Decomposition
        # (SVD) of X. Assuming X is mean-centered (zero mean for each feature),
        # the SVD of X is given by:
        #
        # X = U S V^T
        #
        # The covariance matrix of X can be expressed using its SVD:
        #
        # cov(X) = (X^T X) / (n - 1)
        #        = (V S^2 V^T) / (n - 1)
        #
        # By comparing cov(X) = E * Lambda * E^T with the SVD form, we identify:
        #
        #   E = V (eigenvectors)
        #   Lambda = S^2 / (n - 1) (eigenvalues)
        #
        # Thus, Lambda^(-1/2) can be expressed as:
        #
        #   Lambda^(-1/2) = inv(S) * sqrt(n - 1)
        #
        # Therefore, the PCA Whitening matrix W becomes:
        #
        # W = Lambda^(-1/2) * E^T
        #   = (inv(S) * sqrt(n - 1)) * V^T
        #   = (inv(S) * V^T) * sqrt(n - 1)
        #
        # In NumPy, this is implemented as:
        #
        #   W = (np.linalg.inv(S) @ Vt) * np.sqrt(n - 1)
        #
        # But computing `np.linalg.inv(S)` is memory-intensive.
        #
        # A more memory-efficient alternative is:
        #
        #   W = (Vt / Sigma[:, np.newaxis]) * np.sqrt(n - 1)
        #
        # where Sigma contains the diagonal elements of S (singular values).

        self.W = (Vt / Sigma[:, np.newaxis]).transpose() * np.sqrt(n - 1)

        # If ZCA, perform additional rotation
        if self.method in ["ZCA", "ZCA-cor"]:
            # Note: There was previously a requirement r==d otherwise the
            # ZCA transform would not be well defined. However, it later appeared
            # that this requirement was not necessary.

            self.W = self.W @ Vt

        # number of columns of self.W should be equal to that of X
        if not (self.W.shape[1] == X.shape[1]):
            raise ValueError(
                f"Error: W has {self.W.shape[1]} columns, X has {X.shape[1]} columns"
            )

        if self.W.shape[1] != X.shape[1]:
            error_detail = (
                f"The number of columns of W should be equal to that of X."
                f"However, W has {self.W.shape[1]} columns, X has {X.shape[1]} columns"
            )
            context = "the call to `np.linalg.svd` in `pycytominer.transform.Spherize`"
            raise ValueError(f"{error_detail}. This is likely a bug in {context}.")

        return self

    def transform(self, X, y=None):
        """Perform the sphering transform.

        Parameters
        ----------
        X : pd.core.frame.DataFrame
            Profile dataframe to be transformed using the precompiled weights
        y : None
            Has no effect; only used for consistency in sklearn transform API

        Returns
        -------
        pandas.core.frame.DataFrame
            Spherized dataframe
        """
        columns = X.columns

        # Get Numpy representation of the DataFrame
        X = X.values

        if self.method in ["PCA-cor", "ZCA-cor"]:
            X = self.standard_scaler.transform(X)
        else:
            if self.center:
                X = self.mean_centerer.transform(X)

        if self.method in ["PCA", "PCA-cor"]:
            columns = ["PC" + str(i) for i in range(1, X.shape[1] + 1)]

        self.columns = columns

        XW = X @ self.W

        if self.return_numpy:
            return XW
        else:
            return pd.DataFrame(XW, columns=columns)

`init(epsilon=1e-06, center=True, method='ZCA', return_numpy=False)` ¶

Construct a Spherize object.

Parameters:

Name	Type	Description	Default
`epsilon`	`float`	fudge factor parameter	`1e-6`
`center`	`bool`	option to center the input X matrix	`True`
`method`	`str`	a string indicating which class of sphering to perform	`"ZCA"`
`return_numpy`		option to return ndarray, instead of dataframe	`False`

Source code in pycytominer/operations/transform.py

def __init__(self, epsilon=1e-6, center=True, method="ZCA", return_numpy=False):
    """Construct a Spherize object.

    Parameters
    ----------
    epsilon : float, default 1e-6
        fudge factor parameter
    center : bool, default True
        option to center the input X matrix
    method : str, default "ZCA"
        a string indicating which class of sphering to perform
    return_numpy: bool, default False
        option to return ndarray, instead of dataframe
    """
    avail_methods = ["PCA", "ZCA", "PCA-cor", "ZCA-cor"]

    self.epsilon = epsilon
    self.center = center
    self.return_numpy = return_numpy

    if method not in avail_methods:
        raise ValueError(
            f"Error {method} not supported. Select one of {avail_methods}"
        )
    self.method = method

    # PCA-cor and ZCA-cor require center=True because we assumed we are
    # only ever interested in computing centered Pearson correlation
    # https://stackoverflow.com/questions/23891391/uncentered-pearson-correlation

    if self.method in ["PCA-cor", "ZCA-cor"] and not self.center:
        raise ValueError("PCA-cor and ZCA-cor require center=True")

`fit(X, y=None)` ¶

Identify the sphering transform given self.X.

Parameters:

Name	Type	Description	Default
`X`	`DataFrame`	dataframe to fit sphering transform	required

Returns:

Type	Description
`self`	With computed weights attribute

Source code in pycytominer/operations/transform.py

def fit(self, X, y=None):
    """Identify the sphering transform given self.X.

    Parameters
    ----------
    X : pandas.core.frame.DataFrame
        dataframe to fit sphering transform

    Returns
    -------
    self
        With computed weights attribute
    """
    # Get Numpy representation of the DataFrame
    X = X.values

    if self.method in ["PCA-cor", "ZCA-cor"]:
        # The projection matrix for PCA-cor and ZCA-cor is the same as the
        # projection matrix for PCA and ZCA, respectively, on the standardized
        # data. So, we first standardize the data, then compute the projection

        self.standard_scaler = StandardScaler().fit(X)
        variances = self.standard_scaler.var_
        if np.any(variances == 0):
            raise ValueError(
                "Divide by zero error, make sure low variance columns are removed"
            )

        X = self.standard_scaler.transform(X)
    else:
        if self.center:
            self.mean_centerer = StandardScaler(with_mean=True, with_std=False).fit(
                X
            )
            X = self.mean_centerer.transform(X)

    # Get the number of observations and variables
    n, d = X.shape

    # Checking the rank of matrix X considering the number of samples (n) and features (d).
    r = np.linalg.matrix_rank(X)

    # Case 1: More features than samples (n < d).
    # If centered (mean of each feature subtracted), one dimension becomes dependent, reducing rank to n - 1.
    # If not centered, the max rank is limited by n, as there can't be more independent vectors than samples.
    # Case 2: More samples than features or equal (n >= d).
    # Here, the max rank is limited by d (number of features), assuming each provides unique information.

    if not ((r == d) | (self.center & (r == n - 1)) | (not self.center & (r == n))):
        raise ValueError(
            "The data matrix X is not full rank: n = {n}, d = {d}, r = {r}."
            "Perfect linear dependencies are unusual in data matrices so something seems amiss."
            "Check for linear dependencies in the data and remove them."
        )

    # Get the eigenvalues and eigenvectors of the covariance matrix
    # by computing the SVD on the data matrix
    # https://stats.stackexchange.com/q/134282/8494
    _, Sigma, Vt = np.linalg.svd(X, full_matrices=True)

    # if n <= d then Sigma has shape (n,) so it will need to be expanded to
    # d filled with the value r'th element of Sigma
    if n <= d:
        # Do some error checking
        if Sigma.shape[0] != n:
            error_detail = f"When n <= d, Sigma should have shape (n,) i.e. ({n}, ) but it is {Sigma.shape}"
            context = (
                "the call to `np.linalg.svd` in `pycytominer.transform.Spherize`"
            )
            raise ValueError(f"{error_detail}. This is likely a bug in {context}.")

        if (r != n - 1) & (r != n):
            error_detail = f"When n <= d, the rank should be n - 1 or n i.e. {n - 1} or {n} but it is {r}"
            context = (
                "the call to `np.linalg.svd` in `pycytominer.transform.Spherize`"
            )
            raise ValueError(f"{error_detail}. This is likely a bug in {context}.")

        Sigma = np.concatenate((Sigma[0:r], np.repeat(Sigma[r - 1], d - r)))

    Sigma = Sigma + self.epsilon

    # From https://arxiv.org/abs/1512.00809, the PCA whitening matrix is
    #
    # W = Lambda^(-1/2) E^T
    #
    # where
    # - E is the eigenvector matrix
    # - Lambda is the (diagonal) eigenvalue matrix
    # of cov(X), where X is the data matrix
    # (i.e., cov(X) = E Lambda E^T)
    #
    # However, W can also be computed using the Singular Value Decomposition
    # (SVD) of X. Assuming X is mean-centered (zero mean for each feature),
    # the SVD of X is given by:
    #
    # X = U S V^T
    #
    # The covariance matrix of X can be expressed using its SVD:
    #
    # cov(X) = (X^T X) / (n - 1)
    #        = (V S^2 V^T) / (n - 1)
    #
    # By comparing cov(X) = E * Lambda * E^T with the SVD form, we identify:
    #
    #   E = V (eigenvectors)
    #   Lambda = S^2 / (n - 1) (eigenvalues)
    #
    # Thus, Lambda^(-1/2) can be expressed as:
    #
    #   Lambda^(-1/2) = inv(S) * sqrt(n - 1)
    #
    # Therefore, the PCA Whitening matrix W becomes:
    #
    # W = Lambda^(-1/2) * E^T
    #   = (inv(S) * sqrt(n - 1)) * V^T
    #   = (inv(S) * V^T) * sqrt(n - 1)
    #
    # In NumPy, this is implemented as:
    #
    #   W = (np.linalg.inv(S) @ Vt) * np.sqrt(n - 1)
    #
    # But computing `np.linalg.inv(S)` is memory-intensive.
    #
    # A more memory-efficient alternative is:
    #
    #   W = (Vt / Sigma[:, np.newaxis]) * np.sqrt(n - 1)
    #
    # where Sigma contains the diagonal elements of S (singular values).

    self.W = (Vt / Sigma[:, np.newaxis]).transpose() * np.sqrt(n - 1)

    # If ZCA, perform additional rotation
    if self.method in ["ZCA", "ZCA-cor"]:
        # Note: There was previously a requirement r==d otherwise the
        # ZCA transform would not be well defined. However, it later appeared
        # that this requirement was not necessary.

        self.W = self.W @ Vt

    # number of columns of self.W should be equal to that of X
    if not (self.W.shape[1] == X.shape[1]):
        raise ValueError(
            f"Error: W has {self.W.shape[1]} columns, X has {X.shape[1]} columns"
        )

    if self.W.shape[1] != X.shape[1]:
        error_detail = (
            f"The number of columns of W should be equal to that of X."
            f"However, W has {self.W.shape[1]} columns, X has {X.shape[1]} columns"
        )
        context = "the call to `np.linalg.svd` in `pycytominer.transform.Spherize`"
        raise ValueError(f"{error_detail}. This is likely a bug in {context}.")

    return self

`transform(X, y=None)` ¶

Perform the sphering transform.

Parameters:

Name	Type	Description	Default
`X`	`DataFrame`	Profile dataframe to be transformed using the precompiled weights	required
`y`	`None`	Has no effect; only used for consistency in sklearn transform API	`None`

Returns:

Type	Description
`DataFrame`	Spherized dataframe

Source code in pycytominer/operations/transform.py

def transform(self, X, y=None):
    """Perform the sphering transform.

    Parameters
    ----------
    X : pd.core.frame.DataFrame
        Profile dataframe to be transformed using the precompiled weights
    y : None
        Has no effect; only used for consistency in sklearn transform API

    Returns
    -------
    pandas.core.frame.DataFrame
        Spherized dataframe
    """
    columns = X.columns

    # Get Numpy representation of the DataFrame
    X = X.values

    if self.method in ["PCA-cor", "ZCA-cor"]:
        X = self.standard_scaler.transform(X)
    else:
        if self.center:
            X = self.mean_centerer.transform(X)

    if self.method in ["PCA", "PCA-cor"]:
        columns = ["PC" + str(i) for i in range(1, X.shape[1] + 1)]

    self.columns = columns

    XW = X @ self.W

    if self.return_numpy:
        return XW
    else:
        return pd.DataFrame(XW, columns=columns)

`pycytominer.operations.variance_threshold` ¶

Remove variables with near-zero variance.

Modified from caret::nearZeroVar().

`calculate_frequency(feature_column, freq_cut)` ¶

Calculate frequency of second most common to most common feature.

Parameters:

Name	Type	Description	Default
`feature_column`	`series`	Pandas series of the specific feature in the population_df	required
`freq_cut`	`float`	Ratio (2nd most common feature val / most common). Must range between 0 and 1. Remove features lower than freq_cut. A low freq_cut will remove features that have large difference between the most common feature and second most common feature. (e.g. this will remove a feature: [1, 1, 1, 1, 0.01, 0.01, ...])	`0.05`

Returns:

Type	Description
`Feature name if it passes threshold, "NA" otherwise`

Source code in pycytominer/operations/variance_threshold.py

def calculate_frequency(feature_column, freq_cut):
    """Calculate frequency of second most common to most common feature.

    Parameters
    ----------
    feature_column : pandas.core.series.series
        Pandas series of the specific feature in the population_df
    freq_cut : float, default 0.05
        Ratio (2nd most common feature val / most common). Must range between 0 and 1.
        Remove features lower than freq_cut. A low freq_cut will remove features
        that have large difference between the most common feature and second most
        common feature. (e.g. this will remove a feature: [1, 1, 1, 1, 0.01, 0.01, ...])

    Returns
    -------
    Feature name if it passes threshold, "NA" otherwise

    """
    val_count = feature_column.value_counts()
    try:
        max_count = val_count.iloc[0]
    except IndexError:
        return np.nan
    try:
        second_max_count = val_count.iloc[1]
    except IndexError:
        return np.nan

    freq = second_max_count / max_count

    if freq < freq_cut:
        return np.nan
    else:
        return feature_column.name

`variance_threshold(population_df, features='infer', samples='all', freq_cut=0.05, unique_cut=0.01)` ¶

Exclude features that have low variance (low information content).

Parameters:

Name	Type	Description	Default
`population_df`	`DataFrame`	DataFrame that includes metadata and observation features.	required
`features`	`list`	List of features present in the population dataframe [default: "infer"] if "infer", then assume cell painting features are those that start with "Cells_", "Nuclei_", or "Cytoplasm_".	`"infer"`
`samples`	`str`	List of samples to perform operation on. The function uses a pd.DataFrame.query() function, so you should structure samples in this fashion. An example is "Metadata_treatment == 'control'" (include all quotes). If "all", use all samples to calculate.	`"all"`
`freq_cut`	`float`	Ratio (2nd most common feature val / most common). Must range between 0 and 1. Remove features lower than freq_cut. A low freq_cut will remove features that have large difference between the most common feature value and second most common feature value. (e.g. this will remove a feature: [1, 1, 1, 1, 0.01, 0.01, ...])	`0.05`
`unique_cut`		Ratio (num unique features / num samples). Must range between 0 and 1. Remove features less than unique cut. A low unique_cut will remove features that have very few different measurements compared to the number of samples.	`0.01`

Returns:

Name	Type	Description
`excluded_features`	`list of str`	List of features to exclude from the population_df.

Source code in pycytominer/operations/variance_threshold.py

def variance_threshold(
    population_df, features="infer", samples="all", freq_cut=0.05, unique_cut=0.01
):
    """Exclude features that have low variance (low information content).

    Parameters
    ----------
    population_df : pandas.core.frame.DataFrame
        DataFrame that includes metadata and observation features.
    features : list, default "infer"
         List of features present in the population dataframe [default: "infer"]
         if "infer", then assume cell painting features are those that start with
         "Cells_", "Nuclei_", or "Cytoplasm_".
    samples : str, default "all"
        List of samples to perform operation on. The function uses a pd.DataFrame.query()
        function, so you should  structure samples in this fashion. An example is
        "Metadata_treatment == 'control'" (include all quotes).
        If "all", use all samples to calculate.
    freq_cut : float, default 0.05
        Ratio (2nd most common feature val / most common). Must range between 0 and 1.
        Remove features lower than freq_cut. A low freq_cut will remove features
        that have large difference between the most common feature value and second most
        common feature value. (e.g. this will remove a feature: [1, 1, 1, 1, 0.01, 0.01, ...])
    unique_cut: float, default 0.01
        Ratio (num unique features / num samples). Must range between 0 and 1.
        Remove features less than unique cut. A low unique_cut will remove features
        that have very few different measurements compared to the number of samples.

    Returns
    -------
    excluded_features : list of str
         List of features to exclude from the population_df.

    """
    if not 0 <= freq_cut <= 1:
        raise ValueError("freq_cut variable must be between (0 and 1)")
    if not 0 <= unique_cut <= 1:
        raise ValueError("unique_cut variable must be between (0 and 1)")

    # Subset dataframe
    if samples != "all":
        population_df.query(samples, inplace=True)

    if features == "infer":
        features = infer_cp_features(population_df)

    population_df = population_df.loc[:, features]

    # Exclude features with extreme (defined by freq_cut ratio) common values
    excluded_features_freq = population_df.apply(
        lambda x: calculate_frequency(x, freq_cut), axis="rows"
    )

    excluded_features_freq = excluded_features_freq[
        excluded_features_freq.isna()
    ].index.tolist()

    # Exclude features with too many (defined by unique_ratio) values in common
    n = population_df.shape[0]
    num_unique_features = population_df.nunique()

    unique_ratio = num_unique_features / n
    unique_ratio = unique_ratio < unique_cut
    excluded_features_unique = unique_ratio[unique_ratio].index.tolist()

    excluded_features = list(set(excluded_features_freq + excluded_features_unique))
    return excluded_features

Operations¶

pycytominer.operations.correlation_threshold ¶

correlation_threshold(population_df, features='infer', samples='all', threshold=0.9, method='pearson') ¶

determine_high_cor_pair(correlation_row, sorted_correlation_pairs) ¶

pycytominer.operations.get_na_columns ¶

get_na_columns(population_df, features='infer', samples='all', cutoff=0.05) ¶

pycytominer.operations.noise_removal ¶

noise_removal(population_df, noise_removal_perturb_groups, features='infer', samples='all', noise_removal_stdev_cutoff=0.8) ¶

pycytominer.operations.transform ¶

RobustMAD ¶

fit(X, y=None) ¶

transform(X, copy=None) ¶

Spherize ¶

__init__(epsilon=1e-06, center=True, method='ZCA', return_numpy=False) ¶

fit(X, y=None) ¶

transform(X, y=None) ¶

pycytominer.operations.variance_threshold ¶

calculate_frequency(feature_column, freq_cut) ¶

variance_threshold(population_df, features='infer', samples='all', freq_cut=0.05, unique_cut=0.01) ¶

`pycytominer.operations.correlation_threshold` ¶

`correlation_threshold(population_df, features='infer', samples='all', threshold=0.9, method='pearson')` ¶

`determine_high_cor_pair(correlation_row, sorted_correlation_pairs)` ¶

`pycytominer.operations.get_na_columns` ¶

`get_na_columns(population_df, features='infer', samples='all', cutoff=0.05)` ¶

`pycytominer.operations.noise_removal` ¶

`noise_removal(population_df, noise_removal_perturb_groups, features='infer', samples='all', noise_removal_stdev_cutoff=0.8)` ¶

`pycytominer.operations.transform` ¶

`RobustMAD` ¶

`fit(X, y=None)` ¶

`transform(X, copy=None)` ¶

`Spherize` ¶

`init(epsilon=1e-06, center=True, method='ZCA', return_numpy=False)` ¶

`fit(X, y=None)` ¶

`transform(X, y=None)` ¶

`pycytominer.operations.variance_threshold` ¶

`calculate_frequency(feature_column, freq_cut)` ¶

`variance_threshold(population_df, features='infer', samples='all', freq_cut=0.05, unique_cut=0.01)` ¶