Skip to content

Operations

We do not recommend interacting with these functions directly. The core pycytominer API uses these operations internally.

Module containing statistical operations for data processing.

pycytominer.operations.correlation_threshold

Module for correlation threshold operation.

The correlation threshold operation list of features such that no two features have a correlation greater than a specified threshold.

correlation_threshold(population_df, features='infer', samples='all', threshold=0.9, method='pearson')

Exclude features that have correlations above a certain threshold.

Parameters:

Name Type Description Default
population_df DataFrame

DataFrame that includes metadata and observation features.

required
features list

List of features present in the population dataframe [default: "infer"] if "infer", then assume cell painting features are those that start with "Cells_", "Nuclei_", or "Cytoplasm_".

"infer"
samples str

List of samples to perform operation on. The function uses a pd.DataFrame.query() function, so you should structure samples in this fashion. An example is "Metadata_treatment == 'control'" (include all quotes). If "all", use all samples to calculate.

"all"
threshold

Must be between (0, 1) to exclude features

0.9
method

indicating which correlation metric to use to test cutoff

'pearson'

Returns:

Name Type Description
excluded_features list of str

List of features to exclude from the population_df.

Source code in pycytominer/operations/correlation_threshold.py
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
def correlation_threshold(
    population_df, features="infer", samples="all", threshold=0.9, method="pearson"
):
    """Exclude features that have correlations above a certain threshold.

    Parameters
    ----------
    population_df : pandas.core.frame.DataFrame
        DataFrame that includes metadata and observation features.
    features : list, default "infer"
         List of features present in the population dataframe [default: "infer"]
         if "infer", then assume cell painting features are those that start with
         "Cells_", "Nuclei_", or "Cytoplasm_".
    samples : str, default "all"
        List of samples to perform operation on. The function uses a pd.DataFrame.query()
        function, so you should  structure samples in this fashion. An example is
        "Metadata_treatment == 'control'" (include all quotes).
        If "all", use all samples to calculate.
    threshold - float, default 0.9
        Must be between (0, 1) to exclude features
    method - str, default "pearson"
        indicating which correlation metric to use to test cutoff

    Returns
    -------
    excluded_features : list of str
         List of features to exclude from the population_df.
    """
    # Check that the input method is supported
    method = check_correlation_method(method)

    if not 0 <= threshold <= 1:
        raise ValueError("threshold variable must be between (0 and 1)")

    # Subset dataframe and calculate correlation matrix across subset features
    if samples != "all":
        population_df.query(samples, inplace=True)

    if features == "infer":
        features = infer_cp_features(population_df)

    population_df = population_df.loc[:, features]

    # Get correlation matrix and lower triangle of pairwise correlations in long format
    data_cor_df, pairwise_df = get_pairwise_correlation(
        population_df=population_df, method=method
    )

    # Get absolute sum of correlation across features
    # The lower the index, the less correlation to the full data frame
    # We want to drop features with highest correlation, so drop higher index
    variable_cor_sum = data_cor_df.abs().sum().sort_values().index

    # And subset to only variable combinations that pass the threshold
    pairwise_df = pairwise_df.query("correlation > @threshold")

    # Return an empty list if nothing is over correlation threshold
    if pairwise_df.shape[0] == 0:
        return []

    # Output the excluded features
    excluded = pairwise_df.apply(
        lambda x: determine_high_cor_pair(x, variable_cor_sum), axis="columns"
    )

    return list(set(excluded.tolist()))

determine_high_cor_pair(correlation_row, sorted_correlation_pairs)

Select highest correlated variable given a correlation row.

From a row with columns: ["pair_a", "pair_b", "correlation"]. For use in a pandas.apply().

Parameters:

Name Type Description Default
correlation_row series

Pandas series of the specific feature in the pairwise_df

required
sorted_correlation_pairs index

A sorted object by total correlative sum to all other features

required

Returns:

Type Description
The feature that has a lower total correlation sum with all other features
Source code in pycytominer/operations/correlation_threshold.py
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
def determine_high_cor_pair(correlation_row, sorted_correlation_pairs):
    """Select highest correlated variable given a correlation row.

    From a row with columns: ["pair_a", "pair_b", "correlation"]. For use in a pandas.apply().

    Parameters
    ----------
    correlation_row : pandas.core.series.series
        Pandas series of the specific feature in the pairwise_df
    sorted_correlation_pairs : pandas.DataFrame.index
        A sorted object by total correlative sum to all other features

    Returns
    -------
    The feature that has a lower total correlation sum with all other features
    """
    pair_a = correlation_row["pair_a"]
    pair_b = correlation_row["pair_b"]

    if sorted_correlation_pairs.get_loc(pair_a) > sorted_correlation_pairs.get_loc(
        pair_b
    ):
        return pair_a
    else:
        return pair_b

pycytominer.operations.get_na_columns

Function to get columns with NA values above a certain threshold.

Remove variables with specified threshold of NA values Note: This was called drop_na_columns in cytominer for R.

get_na_columns(population_df, features='infer', samples='all', cutoff=0.05)

Get features that have more NA values than cutoff defined.

Parameters:

Name Type Description Default
population_df DataFrame

DataFrame that includes metadata and observation features.

required
features list

List of features present in the population dataframe [default: "infer"] if "infer", then assume cell painting features are those that start with "Cells_", "Nuclei_", or "Cytoplasm_".

"infer"
samples str

List of samples to perform operation on. The function uses a pd.DataFrame.query() function, so you should structure samples in this fashion. An example is "Metadata_treatment == 'control'" (include all quotes). If "all", use all samples to calculate.

"all"
cutoff float

Exclude features that have a certain proportion of missingness

0.05

Returns:

Name Type Description
excluded_features list of str

List of features to exclude from the population_df.

Source code in pycytominer/operations/get_na_columns.py
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
def get_na_columns(population_df, features="infer", samples="all", cutoff=0.05):
    """Get features that have more NA values than cutoff defined.

    Parameters
    ----------
    population_df : pandas.core.frame.DataFrame
        DataFrame that includes metadata and observation features.
    features : list, default "infer"
         List of features present in the population dataframe [default: "infer"]
         if "infer", then assume cell painting features are those that start with
         "Cells_", "Nuclei_", or "Cytoplasm_".
    samples : str, default "all"
        List of samples to perform operation on. The function uses a pd.DataFrame.query()
        function, so you should  structure samples in this fashion. An example is
        "Metadata_treatment == 'control'" (include all quotes).
        If "all", use all samples to calculate.
    cutoff : float
        Exclude features that have a certain proportion of missingness

    Returns
    -------
    excluded_features : list of str
         List of features to exclude from the population_df.
    """
    if samples != "all":
        population_df.query(samples, inplace=True)

    if features == "infer":
        features = infer_cp_features(population_df)
    else:
        population_df = population_df.loc[:, features]

    num_rows = population_df.shape[0]
    na_prop_df = population_df.isna().sum() / num_rows

    na_prop_df = na_prop_df[na_prop_df > cutoff]
    return list(set(na_prop_df.index.tolist()))

pycytominer.operations.noise_removal

Remove noisy features, as defined by features with excessive standard deviation within the same perturbation group.

noise_removal(population_df, noise_removal_perturb_groups, features='infer', samples='all', noise_removal_stdev_cutoff=0.8)

Remove features with excessive standard deviation within the same perturbation group.

Parameters:

Name Type Description Default
population_df DataFrame

DataFrame that includes metadata and observation features.

required
noise_removal_perturb_groups list or array of str

The list of unique perturbations corresponding to the rows in population_df. For example, perturb1_well1 and perturb1_well2 would both be "perturb1".

required
features list

List of features present in the population dataframe [default: "infer"] if "infer", then assume cell painting features are those that start with "Cells_", "Nuclei_", or "Cytoplasm_".

"infer"
samples str

List of samples to perform operation on. The function uses a pd.DataFrame.query() function, so you should structure samples in this fashion. An example is "Metadata_treatment == 'control'" (include all quotes). If "all", use all samples to calculate.

"all"
noise_removal_stdev_cutoff float

Maximum mean stdev value for a feature to be kept, with features grouped according to the perturbations in noise_removal_perturbation_groups.

0.8

Returns:

Name Type Description
to_remove list

A list of features to be removed, due to having too high standard deviation within replicate groups.

Source code in pycytominer/operations/noise_removal.py
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
def noise_removal(
    population_df,
    noise_removal_perturb_groups,
    features="infer",
    samples="all",
    noise_removal_stdev_cutoff=0.8,
):
    """Remove features with excessive standard deviation within the same perturbation group.

    Parameters
    ----------
    population_df : pandas.core.frame.DataFrame
        DataFrame that includes metadata and observation features.
    noise_removal_perturb_groups : list or array of str
        The list of unique perturbations corresponding to the rows in population_df. For example,
        perturb1_well1 and perturb1_well2 would both be "perturb1".
    features : list, default "infer"
         List of features present in the population dataframe [default: "infer"]
         if "infer", then assume cell painting features are those that start with
         "Cells_", "Nuclei_", or "Cytoplasm_".
    samples : str, default "all"
        List of samples to perform operation on. The function uses a pd.DataFrame.query()
        function, so you should  structure samples in this fashion. An example is
        "Metadata_treatment == 'control'" (include all quotes).
        If "all", use all samples to calculate.
    noise_removal_stdev_cutoff : float
        Maximum mean stdev value for a feature to be kept, with features grouped according to the perturbations in
        noise_removal_perturbation_groups.

    Returns
    -------
    to_remove : list
        A list of features to be removed, due to having too high standard deviation within replicate groups.

    """
    # Subset dataframe
    if samples != "all":
        population_df.query(samples, inplace=True)

    if features == "infer":
        features = infer_cp_features(population_df)

    # If a metadata column name is specified, use that as the perturb groups
    if isinstance(noise_removal_perturb_groups, str):
        if noise_removal_perturb_groups not in population_df.columns:
            raise ValueError(
                'f"{perturb} not found. Are you sure it is a ' "metadata column?"
            )
        group_info = population_df[noise_removal_perturb_groups]

    # Otherwise, the user specifies a list of perturbs
    elif isinstance(noise_removal_perturb_groups, list):
        if not len(noise_removal_perturb_groups) == len(population_df):
            raise ValueError(
                f"The length of input list: {len(noise_removal_perturb_groups)} is not equivalent to your "
                f"data: {population_df.shape[0]}"
            )
        group_info = noise_removal_perturb_groups
    else:
        raise TypeError(
            "noise_removal_perturb_groups must be a list corresponding to row perturbations or a str \
                        specifying the name of the metadata column."
        )

    # Subset and df and assign each row with the identity of its perturbation group
    population_df = population_df.loc[:, features]
    population_df = population_df.assign(group_id=group_info)

    # Get the standard deviations of features within each group
    stdev_means_df = population_df.groupby("group_id").std(ddof=0).mean()

    # Identify noisy features with a greater mean stdev within perturbation group than the threshold
    to_remove = stdev_means_df[
        stdev_means_df > noise_removal_stdev_cutoff
    ].index.tolist()

    return to_remove

pycytominer.operations.transform

Transform observation variables by specified groups.

References

.. [1] Kessy et al. 2016 "Optimal Whitening and Decorrelation" arXiv: https://arxiv.org/abs/1512.00809

RobustMAD

Bases: BaseEstimator, TransformerMixin

Class to perform a "Robust" normalization with respect to median and mad.

scaled = (x - median) / mad

Attributes:

Name Type Description
epsilon float

fudge factor parameter

Source code in pycytominer/operations/transform.py
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
class RobustMAD(BaseEstimator, TransformerMixin):
    """Class to perform a "Robust" normalization with respect to median and mad.

        scaled = (x - median) / mad

    Attributes
    ----------
    epsilon : float
        fudge factor parameter
    """

    def __init__(self, epsilon=1e-18):
        self.epsilon = epsilon

    def fit(self, X, y=None):
        """Compute the median and mad to be used for later scaling.

        Parameters
        ----------
        X : pandas.core.frame.DataFrame
            dataframe to fit RobustMAD transform

        Returns
        -------
        self
            With computed median and mad attributes
        """
        # Get the mean of the features (columns) and center if specified
        self.median = X.median()
        # The scale param is required to preserve previous behavior. More info at:
        # https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.median_absolute_deviation.html#scipy.stats.median_absolute_deviation
        self.mad = pd.Series(
            median_abs_deviation(X, nan_policy="omit", scale=1 / 1.4826),
            index=self.median.index,
        )
        return self

    def transform(self, X, copy=None):
        """Apply the RobustMAD calculation.

        Parameters
        ----------
        X : pandas.core.frame.DataFrame
            dataframe to fit RobustMAD transform

        Returns
        -------
        pandas.core.frame.DataFrame
            RobustMAD transformed dataframe
        """
        return (X - self.median) / (self.mad + self.epsilon)

fit(X, y=None)

Compute the median and mad to be used for later scaling.

Parameters:

Name Type Description Default
X DataFrame

dataframe to fit RobustMAD transform

required

Returns:

Type Description
self

With computed median and mad attributes

Source code in pycytominer/operations/transform.py
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
def fit(self, X, y=None):
    """Compute the median and mad to be used for later scaling.

    Parameters
    ----------
    X : pandas.core.frame.DataFrame
        dataframe to fit RobustMAD transform

    Returns
    -------
    self
        With computed median and mad attributes
    """
    # Get the mean of the features (columns) and center if specified
    self.median = X.median()
    # The scale param is required to preserve previous behavior. More info at:
    # https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.median_absolute_deviation.html#scipy.stats.median_absolute_deviation
    self.mad = pd.Series(
        median_abs_deviation(X, nan_policy="omit", scale=1 / 1.4826),
        index=self.median.index,
    )
    return self

transform(X, copy=None)

Apply the RobustMAD calculation.

Parameters:

Name Type Description Default
X DataFrame

dataframe to fit RobustMAD transform

required

Returns:

Type Description
DataFrame

RobustMAD transformed dataframe

Source code in pycytominer/operations/transform.py
300
301
302
303
304
305
306
307
308
309
310
311
312
313
def transform(self, X, copy=None):
    """Apply the RobustMAD calculation.

    Parameters
    ----------
    X : pandas.core.frame.DataFrame
        dataframe to fit RobustMAD transform

    Returns
    -------
    pandas.core.frame.DataFrame
        RobustMAD transformed dataframe
    """
    return (X - self.median) / (self.mad + self.epsilon)

Spherize

Bases: BaseEstimator, TransformerMixin

Class to apply a sphering transform (aka whitening) data in the base sklearn transform API.

This implementation is modified/inspired from the following sources: 1) A custom function written by Juan C. Caicedo 2) A custom ZCA function at https://github.com/mwv/zca 3) Notes from Niranj Chandrasekaran (https://github.com/cytomining/pycytominer/issues/90) 4) The R package "whitening" written by Strimmer et al (http://strimmerlab.org/software/whitening/) 5) Kessy et al. 2016 "Optimal Whitening and Decorrelation" [1]_.

Attributes:

Name Type Description
epsilon float

fudge factor parameter

center bool

option to center the input X matrix

method str

a string indicating which class of sphering to perform

Source code in pycytominer/operations/transform.py
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
class Spherize(BaseEstimator, TransformerMixin):
    """Class to apply a sphering transform (aka whitening) data in the base sklearn transform API.

    This implementation is modified/inspired from the following sources:
    1) A custom function written by Juan C. Caicedo
    2) A custom ZCA function at https://github.com/mwv/zca
    3) Notes from Niranj Chandrasekaran (https://github.com/cytomining/pycytominer/issues/90)
    4) The R package "whitening" written by Strimmer et al (http://strimmerlab.org/software/whitening/)
    5) Kessy et al. 2016 "Optimal Whitening and Decorrelation" [1]_.

    Attributes
    ----------
    epsilon : float
        fudge factor parameter
    center : bool
        option to center the input X matrix
    method : str
        a string indicating which class of sphering to perform
    """

    def __init__(self, epsilon=1e-6, center=True, method="ZCA", return_numpy=False):
        """Construct a Spherize object.

        Parameters
        ----------
        epsilon : float, default 1e-6
            fudge factor parameter
        center : bool, default True
            option to center the input X matrix
        method : str, default "ZCA"
            a string indicating which class of sphering to perform
        return_numpy: bool, default False
            option to return ndarray, instead of dataframe
        """
        avail_methods = ["PCA", "ZCA", "PCA-cor", "ZCA-cor"]

        self.epsilon = epsilon
        self.center = center
        self.return_numpy = return_numpy

        if method not in avail_methods:
            raise ValueError(
                f"Error {method} not supported. Select one of {avail_methods}"
            )
        self.method = method

        # PCA-cor and ZCA-cor require center=True because we assumed we are
        # only ever interested in computing centered Pearson correlation
        # https://stackoverflow.com/questions/23891391/uncentered-pearson-correlation

        if self.method in ["PCA-cor", "ZCA-cor"] and not self.center:
            raise ValueError("PCA-cor and ZCA-cor require center=True")

    def fit(self, X, y=None):
        """Identify the sphering transform given self.X.

        Parameters
        ----------
        X : pandas.core.frame.DataFrame
            dataframe to fit sphering transform

        Returns
        -------
        self
            With computed weights attribute
        """
        # Get Numpy representation of the DataFrame
        X = X.values

        if self.method in ["PCA-cor", "ZCA-cor"]:
            # The projection matrix for PCA-cor and ZCA-cor is the same as the
            # projection matrix for PCA and ZCA, respectively, on the standardized
            # data. So, we first standardize the data, then compute the projection

            self.standard_scaler = StandardScaler().fit(X)
            variances = self.standard_scaler.var_
            if np.any(variances == 0):
                raise ValueError(
                    "Divide by zero error, make sure low variance columns are removed"
                )

            X = self.standard_scaler.transform(X)
        else:
            if self.center:
                self.mean_centerer = StandardScaler(with_mean=True, with_std=False).fit(
                    X
                )
                X = self.mean_centerer.transform(X)

        # Get the number of observations and variables
        n, d = X.shape

        # Checking the rank of matrix X considering the number of samples (n) and features (d).
        r = np.linalg.matrix_rank(X)

        # Case 1: More features than samples (n < d).
        # If centered (mean of each feature subtracted), one dimension becomes dependent, reducing rank to n - 1.
        # If not centered, the max rank is limited by n, as there can't be more independent vectors than samples.
        # Case 2: More samples than features or equal (n >= d).
        # Here, the max rank is limited by d (number of features), assuming each provides unique information.

        if not ((r == d) | (self.center & (r == n - 1)) | (not self.center & (r == n))):
            raise ValueError(
                "The data matrix X is not full rank: n = {n}, d = {d}, r = {r}."
                "Perfect linear dependencies are unusual in data matrices so something seems amiss."
                "Check for linear dependencies in the data and remove them."
            )

        # Get the eigenvalues and eigenvectors of the covariance matrix
        # by computing the SVD on the data matrix
        # https://stats.stackexchange.com/q/134282/8494
        _, Sigma, Vt = np.linalg.svd(X, full_matrices=True)

        # if n <= d then Sigma has shape (n,) so it will need to be expanded to
        # d filled with the value r'th element of Sigma
        if n <= d:
            # Do some error checking
            if Sigma.shape[0] != n:
                error_detail = f"When n <= d, Sigma should have shape (n,) i.e. ({n}, ) but it is {Sigma.shape}"
                context = (
                    "the call to `np.linalg.svd` in `pycytominer.transform.Spherize`"
                )
                raise ValueError(f"{error_detail}. This is likely a bug in {context}.")

            if (r != n - 1) & (r != n):
                error_detail = f"When n <= d, the rank should be n - 1 or n i.e. {n - 1} or {n} but it is {r}"
                context = (
                    "the call to `np.linalg.svd` in `pycytominer.transform.Spherize`"
                )
                raise ValueError(f"{error_detail}. This is likely a bug in {context}.")

            Sigma = np.concatenate((Sigma[0:r], np.repeat(Sigma[r - 1], d - r)))

        Sigma = Sigma + self.epsilon

        # From https://arxiv.org/abs/1512.00809, the PCA whitening matrix is
        #
        # W = Lambda^(-1/2) E^T
        #
        # where
        # - E is the eigenvector matrix
        # - Lambda is the (diagonal) eigenvalue matrix
        # of cov(X), where X is the data matrix
        # (i.e., cov(X) = E Lambda E^T)
        #
        # However, W can also be computed using the Singular Value Decomposition
        # (SVD) of X. Assuming X is mean-centered (zero mean for each feature),
        # the SVD of X is given by:
        #
        # X = U S V^T
        #
        # The covariance matrix of X can be expressed using its SVD:
        #
        # cov(X) = (X^T X) / (n - 1)
        #        = (V S^2 V^T) / (n - 1)
        #
        # By comparing cov(X) = E * Lambda * E^T with the SVD form, we identify:
        #
        #   E = V (eigenvectors)
        #   Lambda = S^2 / (n - 1) (eigenvalues)
        #
        # Thus, Lambda^(-1/2) can be expressed as:
        #
        #   Lambda^(-1/2) = inv(S) * sqrt(n - 1)
        #
        # Therefore, the PCA Whitening matrix W becomes:
        #
        # W = Lambda^(-1/2) * E^T
        #   = (inv(S) * sqrt(n - 1)) * V^T
        #   = (inv(S) * V^T) * sqrt(n - 1)
        #
        # In NumPy, this is implemented as:
        #
        #   W = (np.linalg.inv(S) @ Vt) * np.sqrt(n - 1)
        #
        # But computing `np.linalg.inv(S)` is memory-intensive.
        #
        # A more memory-efficient alternative is:
        #
        #   W = (Vt / Sigma[:, np.newaxis]) * np.sqrt(n - 1)
        #
        # where Sigma contains the diagonal elements of S (singular values).

        self.W = (Vt / Sigma[:, np.newaxis]).transpose() * np.sqrt(n - 1)

        # If ZCA, perform additional rotation
        if self.method in ["ZCA", "ZCA-cor"]:
            # Note: There was previously a requirement r==d otherwise the
            # ZCA transform would not be well defined. However, it later appeared
            # that this requirement was not necessary.

            self.W = self.W @ Vt

        # number of columns of self.W should be equal to that of X
        if not (self.W.shape[1] == X.shape[1]):
            raise ValueError(
                f"Error: W has {self.W.shape[1]} columns, X has {X.shape[1]} columns"
            )

        if self.W.shape[1] != X.shape[1]:
            error_detail = (
                f"The number of columns of W should be equal to that of X."
                f"However, W has {self.W.shape[1]} columns, X has {X.shape[1]} columns"
            )
            context = "the call to `np.linalg.svd` in `pycytominer.transform.Spherize`"
            raise ValueError(f"{error_detail}. This is likely a bug in {context}.")

        return self

    def transform(self, X, y=None):
        """Perform the sphering transform.

        Parameters
        ----------
        X : pd.core.frame.DataFrame
            Profile dataframe to be transformed using the precompiled weights
        y : None
            Has no effect; only used for consistency in sklearn transform API

        Returns
        -------
        pandas.core.frame.DataFrame
            Spherized dataframe
        """
        columns = X.columns

        # Get Numpy representation of the DataFrame
        X = X.values

        if self.method in ["PCA-cor", "ZCA-cor"]:
            X = self.standard_scaler.transform(X)
        else:
            if self.center:
                X = self.mean_centerer.transform(X)

        if self.method in ["PCA", "PCA-cor"]:
            columns = ["PC" + str(i) for i in range(1, X.shape[1] + 1)]

        self.columns = columns

        XW = X @ self.W

        if self.return_numpy:
            return XW
        else:
            return pd.DataFrame(XW, columns=columns)

__init__(epsilon=1e-06, center=True, method='ZCA', return_numpy=False)

Construct a Spherize object.

Parameters:

Name Type Description Default
epsilon float

fudge factor parameter

1e-6
center bool

option to center the input X matrix

True
method str

a string indicating which class of sphering to perform

"ZCA"
return_numpy

option to return ndarray, instead of dataframe

False
Source code in pycytominer/operations/transform.py
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
def __init__(self, epsilon=1e-6, center=True, method="ZCA", return_numpy=False):
    """Construct a Spherize object.

    Parameters
    ----------
    epsilon : float, default 1e-6
        fudge factor parameter
    center : bool, default True
        option to center the input X matrix
    method : str, default "ZCA"
        a string indicating which class of sphering to perform
    return_numpy: bool, default False
        option to return ndarray, instead of dataframe
    """
    avail_methods = ["PCA", "ZCA", "PCA-cor", "ZCA-cor"]

    self.epsilon = epsilon
    self.center = center
    self.return_numpy = return_numpy

    if method not in avail_methods:
        raise ValueError(
            f"Error {method} not supported. Select one of {avail_methods}"
        )
    self.method = method

    # PCA-cor and ZCA-cor require center=True because we assumed we are
    # only ever interested in computing centered Pearson correlation
    # https://stackoverflow.com/questions/23891391/uncentered-pearson-correlation

    if self.method in ["PCA-cor", "ZCA-cor"] and not self.center:
        raise ValueError("PCA-cor and ZCA-cor require center=True")

fit(X, y=None)

Identify the sphering transform given self.X.

Parameters:

Name Type Description Default
X DataFrame

dataframe to fit sphering transform

required

Returns:

Type Description
self

With computed weights attribute

Source code in pycytominer/operations/transform.py
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
def fit(self, X, y=None):
    """Identify the sphering transform given self.X.

    Parameters
    ----------
    X : pandas.core.frame.DataFrame
        dataframe to fit sphering transform

    Returns
    -------
    self
        With computed weights attribute
    """
    # Get Numpy representation of the DataFrame
    X = X.values

    if self.method in ["PCA-cor", "ZCA-cor"]:
        # The projection matrix for PCA-cor and ZCA-cor is the same as the
        # projection matrix for PCA and ZCA, respectively, on the standardized
        # data. So, we first standardize the data, then compute the projection

        self.standard_scaler = StandardScaler().fit(X)
        variances = self.standard_scaler.var_
        if np.any(variances == 0):
            raise ValueError(
                "Divide by zero error, make sure low variance columns are removed"
            )

        X = self.standard_scaler.transform(X)
    else:
        if self.center:
            self.mean_centerer = StandardScaler(with_mean=True, with_std=False).fit(
                X
            )
            X = self.mean_centerer.transform(X)

    # Get the number of observations and variables
    n, d = X.shape

    # Checking the rank of matrix X considering the number of samples (n) and features (d).
    r = np.linalg.matrix_rank(X)

    # Case 1: More features than samples (n < d).
    # If centered (mean of each feature subtracted), one dimension becomes dependent, reducing rank to n - 1.
    # If not centered, the max rank is limited by n, as there can't be more independent vectors than samples.
    # Case 2: More samples than features or equal (n >= d).
    # Here, the max rank is limited by d (number of features), assuming each provides unique information.

    if not ((r == d) | (self.center & (r == n - 1)) | (not self.center & (r == n))):
        raise ValueError(
            "The data matrix X is not full rank: n = {n}, d = {d}, r = {r}."
            "Perfect linear dependencies are unusual in data matrices so something seems amiss."
            "Check for linear dependencies in the data and remove them."
        )

    # Get the eigenvalues and eigenvectors of the covariance matrix
    # by computing the SVD on the data matrix
    # https://stats.stackexchange.com/q/134282/8494
    _, Sigma, Vt = np.linalg.svd(X, full_matrices=True)

    # if n <= d then Sigma has shape (n,) so it will need to be expanded to
    # d filled with the value r'th element of Sigma
    if n <= d:
        # Do some error checking
        if Sigma.shape[0] != n:
            error_detail = f"When n <= d, Sigma should have shape (n,) i.e. ({n}, ) but it is {Sigma.shape}"
            context = (
                "the call to `np.linalg.svd` in `pycytominer.transform.Spherize`"
            )
            raise ValueError(f"{error_detail}. This is likely a bug in {context}.")

        if (r != n - 1) & (r != n):
            error_detail = f"When n <= d, the rank should be n - 1 or n i.e. {n - 1} or {n} but it is {r}"
            context = (
                "the call to `np.linalg.svd` in `pycytominer.transform.Spherize`"
            )
            raise ValueError(f"{error_detail}. This is likely a bug in {context}.")

        Sigma = np.concatenate((Sigma[0:r], np.repeat(Sigma[r - 1], d - r)))

    Sigma = Sigma + self.epsilon

    # From https://arxiv.org/abs/1512.00809, the PCA whitening matrix is
    #
    # W = Lambda^(-1/2) E^T
    #
    # where
    # - E is the eigenvector matrix
    # - Lambda is the (diagonal) eigenvalue matrix
    # of cov(X), where X is the data matrix
    # (i.e., cov(X) = E Lambda E^T)
    #
    # However, W can also be computed using the Singular Value Decomposition
    # (SVD) of X. Assuming X is mean-centered (zero mean for each feature),
    # the SVD of X is given by:
    #
    # X = U S V^T
    #
    # The covariance matrix of X can be expressed using its SVD:
    #
    # cov(X) = (X^T X) / (n - 1)
    #        = (V S^2 V^T) / (n - 1)
    #
    # By comparing cov(X) = E * Lambda * E^T with the SVD form, we identify:
    #
    #   E = V (eigenvectors)
    #   Lambda = S^2 / (n - 1) (eigenvalues)
    #
    # Thus, Lambda^(-1/2) can be expressed as:
    #
    #   Lambda^(-1/2) = inv(S) * sqrt(n - 1)
    #
    # Therefore, the PCA Whitening matrix W becomes:
    #
    # W = Lambda^(-1/2) * E^T
    #   = (inv(S) * sqrt(n - 1)) * V^T
    #   = (inv(S) * V^T) * sqrt(n - 1)
    #
    # In NumPy, this is implemented as:
    #
    #   W = (np.linalg.inv(S) @ Vt) * np.sqrt(n - 1)
    #
    # But computing `np.linalg.inv(S)` is memory-intensive.
    #
    # A more memory-efficient alternative is:
    #
    #   W = (Vt / Sigma[:, np.newaxis]) * np.sqrt(n - 1)
    #
    # where Sigma contains the diagonal elements of S (singular values).

    self.W = (Vt / Sigma[:, np.newaxis]).transpose() * np.sqrt(n - 1)

    # If ZCA, perform additional rotation
    if self.method in ["ZCA", "ZCA-cor"]:
        # Note: There was previously a requirement r==d otherwise the
        # ZCA transform would not be well defined. However, it later appeared
        # that this requirement was not necessary.

        self.W = self.W @ Vt

    # number of columns of self.W should be equal to that of X
    if not (self.W.shape[1] == X.shape[1]):
        raise ValueError(
            f"Error: W has {self.W.shape[1]} columns, X has {X.shape[1]} columns"
        )

    if self.W.shape[1] != X.shape[1]:
        error_detail = (
            f"The number of columns of W should be equal to that of X."
            f"However, W has {self.W.shape[1]} columns, X has {X.shape[1]} columns"
        )
        context = "the call to `np.linalg.svd` in `pycytominer.transform.Spherize`"
        raise ValueError(f"{error_detail}. This is likely a bug in {context}.")

    return self

transform(X, y=None)

Perform the sphering transform.

Parameters:

Name Type Description Default
X DataFrame

Profile dataframe to be transformed using the precompiled weights

required
y None

Has no effect; only used for consistency in sklearn transform API

None

Returns:

Type Description
DataFrame

Spherized dataframe

Source code in pycytominer/operations/transform.py
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
def transform(self, X, y=None):
    """Perform the sphering transform.

    Parameters
    ----------
    X : pd.core.frame.DataFrame
        Profile dataframe to be transformed using the precompiled weights
    y : None
        Has no effect; only used for consistency in sklearn transform API

    Returns
    -------
    pandas.core.frame.DataFrame
        Spherized dataframe
    """
    columns = X.columns

    # Get Numpy representation of the DataFrame
    X = X.values

    if self.method in ["PCA-cor", "ZCA-cor"]:
        X = self.standard_scaler.transform(X)
    else:
        if self.center:
            X = self.mean_centerer.transform(X)

    if self.method in ["PCA", "PCA-cor"]:
        columns = ["PC" + str(i) for i in range(1, X.shape[1] + 1)]

    self.columns = columns

    XW = X @ self.W

    if self.return_numpy:
        return XW
    else:
        return pd.DataFrame(XW, columns=columns)

pycytominer.operations.variance_threshold

Remove variables with near-zero variance.

Modified from caret::nearZeroVar().

calculate_frequency(feature_column, freq_cut)

Calculate frequency of second most common to most common feature.

Parameters:

Name Type Description Default
feature_column series

Pandas series of the specific feature in the population_df

required
freq_cut float

Ratio (2nd most common feature val / most common). Must range between 0 and 1. Remove features lower than freq_cut. A low freq_cut will remove features that have large difference between the most common feature and second most common feature. (e.g. this will remove a feature: [1, 1, 1, 1, 0.01, 0.01, ...])

0.05

Returns:

Type Description
Feature name if it passes threshold, "NA" otherwise
Source code in pycytominer/operations/variance_threshold.py
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
def calculate_frequency(feature_column, freq_cut):
    """Calculate frequency of second most common to most common feature.

    Parameters
    ----------
    feature_column : pandas.core.series.series
        Pandas series of the specific feature in the population_df
    freq_cut : float, default 0.05
        Ratio (2nd most common feature val / most common). Must range between 0 and 1.
        Remove features lower than freq_cut. A low freq_cut will remove features
        that have large difference between the most common feature and second most
        common feature. (e.g. this will remove a feature: [1, 1, 1, 1, 0.01, 0.01, ...])

    Returns
    -------
    Feature name if it passes threshold, "NA" otherwise

    """
    val_count = feature_column.value_counts()
    try:
        max_count = val_count.iloc[0]
    except IndexError:
        return np.nan
    try:
        second_max_count = val_count.iloc[1]
    except IndexError:
        return np.nan

    freq = second_max_count / max_count

    if freq < freq_cut:
        return np.nan
    else:
        return feature_column.name

variance_threshold(population_df, features='infer', samples='all', freq_cut=0.05, unique_cut=0.01)

Exclude features that have low variance (low information content).

Parameters:

Name Type Description Default
population_df DataFrame

DataFrame that includes metadata and observation features.

required
features list

List of features present in the population dataframe [default: "infer"] if "infer", then assume cell painting features are those that start with "Cells_", "Nuclei_", or "Cytoplasm_".

"infer"
samples str

List of samples to perform operation on. The function uses a pd.DataFrame.query() function, so you should structure samples in this fashion. An example is "Metadata_treatment == 'control'" (include all quotes). If "all", use all samples to calculate.

"all"
freq_cut float

Ratio (2nd most common feature val / most common). Must range between 0 and 1. Remove features lower than freq_cut. A low freq_cut will remove features that have large difference between the most common feature value and second most common feature value. (e.g. this will remove a feature: [1, 1, 1, 1, 0.01, 0.01, ...])

0.05
unique_cut

Ratio (num unique features / num samples). Must range between 0 and 1. Remove features less than unique cut. A low unique_cut will remove features that have very few different measurements compared to the number of samples.

0.01

Returns:

Name Type Description
excluded_features list of str

List of features to exclude from the population_df.

Source code in pycytominer/operations/variance_threshold.py
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
def variance_threshold(
    population_df, features="infer", samples="all", freq_cut=0.05, unique_cut=0.01
):
    """Exclude features that have low variance (low information content).

    Parameters
    ----------
    population_df : pandas.core.frame.DataFrame
        DataFrame that includes metadata and observation features.
    features : list, default "infer"
         List of features present in the population dataframe [default: "infer"]
         if "infer", then assume cell painting features are those that start with
         "Cells_", "Nuclei_", or "Cytoplasm_".
    samples : str, default "all"
        List of samples to perform operation on. The function uses a pd.DataFrame.query()
        function, so you should  structure samples in this fashion. An example is
        "Metadata_treatment == 'control'" (include all quotes).
        If "all", use all samples to calculate.
    freq_cut : float, default 0.05
        Ratio (2nd most common feature val / most common). Must range between 0 and 1.
        Remove features lower than freq_cut. A low freq_cut will remove features
        that have large difference between the most common feature value and second most
        common feature value. (e.g. this will remove a feature: [1, 1, 1, 1, 0.01, 0.01, ...])
    unique_cut: float, default 0.01
        Ratio (num unique features / num samples). Must range between 0 and 1.
        Remove features less than unique cut. A low unique_cut will remove features
        that have very few different measurements compared to the number of samples.

    Returns
    -------
    excluded_features : list of str
         List of features to exclude from the population_df.

    """
    if not 0 <= freq_cut <= 1:
        raise ValueError("freq_cut variable must be between (0 and 1)")
    if not 0 <= unique_cut <= 1:
        raise ValueError("unique_cut variable must be between (0 and 1)")

    # Subset dataframe
    if samples != "all":
        population_df.query(samples, inplace=True)

    if features == "infer":
        features = infer_cp_features(population_df)

    population_df = population_df.loc[:, features]

    # Exclude features with extreme (defined by freq_cut ratio) common values
    excluded_features_freq = population_df.apply(
        lambda x: calculate_frequency(x, freq_cut), axis="rows"
    )

    excluded_features_freq = excluded_features_freq[
        excluded_features_freq.isna()
    ].index.tolist()

    # Exclude features with too many (defined by unique_ratio) values in common
    n = population_df.shape[0]
    num_unique_features = population_df.nunique()

    unique_ratio = num_unique_features / n
    unique_ratio = unique_ratio < unique_cut
    excluded_features_unique = unique_ratio[unique_ratio].index.tolist()

    excluded_features = list(set(excluded_features_freq + excluded_features_unique))
    return excluded_features