Operations¶
We do not recommend interacting with these functions directly. The core pycytominer API uses these operations internally.
Module containing statistical operations for data processing.
pycytominer.operations.correlation_threshold
¶
Module for correlation threshold operation.
The correlation threshold operation list of features such that no two features have a correlation greater than a specified threshold.
correlation_threshold(population_df, features='infer', samples='all', threshold=0.9, method='pearson')
¶
Exclude features that have correlations above a certain threshold.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
population_df |
DataFrame
|
DataFrame that includes metadata and observation features. |
required |
features |
list
|
List of features present in the population dataframe [default: "infer"] if "infer", then assume cell painting features are those that start with "Cells_", "Nuclei_", or "Cytoplasm_". |
"infer"
|
samples |
str
|
List of samples to perform operation on. The function uses a pd.DataFrame.query() function, so you should structure samples in this fashion. An example is "Metadata_treatment == 'control'" (include all quotes). If "all", use all samples to calculate. |
"all"
|
threshold |
Must be between (0, 1) to exclude features |
0.9
|
|
method |
indicating which correlation metric to use to test cutoff |
'pearson'
|
Returns:
| Name | Type | Description |
|---|---|---|
excluded_features |
list of str
|
List of features to exclude from the population_df. |
Source code in pycytominer/operations/correlation_threshold.py
14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 | |
determine_high_cor_pair(correlation_row, sorted_correlation_pairs)
¶
Select highest correlated variable given a correlation row.
From a row with columns: ["pair_a", "pair_b", "correlation"]. For use in a pandas.apply().
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
correlation_row |
series
|
Pandas series of the specific feature in the pairwise_df |
required |
sorted_correlation_pairs |
index
|
A sorted object by total correlative sum to all other features |
required |
Returns:
| Type | Description |
|---|---|
The feature that has a lower total correlation sum with all other features
|
|
Source code in pycytominer/operations/correlation_threshold.py
82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 | |
pycytominer.operations.get_na_columns
¶
Function to get columns with NA values above a certain threshold.
Remove variables with specified threshold of NA values
Note: This was called drop_na_columns in cytominer for R.
get_na_columns(population_df, features='infer', samples='all', cutoff=0.05)
¶
Get features that have more NA values than cutoff defined.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
population_df |
DataFrame
|
DataFrame that includes metadata and observation features. |
required |
features |
list
|
List of features present in the population dataframe [default: "infer"] if "infer", then assume cell painting features are those that start with "Cells_", "Nuclei_", or "Cytoplasm_". |
"infer"
|
samples |
str
|
List of samples to perform operation on. The function uses a pd.DataFrame.query() function, so you should structure samples in this fashion. An example is "Metadata_treatment == 'control'" (include all quotes). If "all", use all samples to calculate. |
"all"
|
cutoff |
float
|
Exclude features that have a certain proportion of missingness |
0.05
|
Returns:
| Name | Type | Description |
|---|---|---|
excluded_features |
list of str
|
List of features to exclude from the population_df. |
Source code in pycytominer/operations/get_na_columns.py
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 | |
pycytominer.operations.noise_removal
¶
Remove noisy features, as defined by features with excessive standard deviation within the same perturbation group.
noise_removal(population_df, noise_removal_perturb_groups, features='infer', samples='all', noise_removal_stdev_cutoff=0.8)
¶
Remove features with excessive standard deviation within the same perturbation group.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
population_df |
DataFrame
|
DataFrame that includes metadata and observation features. |
required |
noise_removal_perturb_groups |
list or array of str
|
The list of unique perturbations corresponding to the rows in population_df. For example, perturb1_well1 and perturb1_well2 would both be "perturb1". |
required |
features |
list
|
List of features present in the population dataframe [default: "infer"] if "infer", then assume cell painting features are those that start with "Cells_", "Nuclei_", or "Cytoplasm_". |
"infer"
|
samples |
str
|
List of samples to perform operation on. The function uses a pd.DataFrame.query() function, so you should structure samples in this fashion. An example is "Metadata_treatment == 'control'" (include all quotes). If "all", use all samples to calculate. |
"all"
|
noise_removal_stdev_cutoff |
float
|
Maximum mean stdev value for a feature to be kept, with features grouped according to the perturbations in noise_removal_perturbation_groups. |
0.8
|
Returns:
| Name | Type | Description |
|---|---|---|
to_remove |
list
|
A list of features to be removed, due to having too high standard deviation within replicate groups. |
Source code in pycytominer/operations/noise_removal.py
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 | |
pycytominer.operations.transform
¶
Transform observation variables by specified groups.
References
.. [1] Kessy et al. 2016 "Optimal Whitening and Decorrelation" arXiv: https://arxiv.org/abs/1512.00809
RobustMAD
¶
Bases: BaseEstimator, TransformerMixin
Class to perform a "Robust" normalization with respect to median and mad.
scaled = (x - median) / mad
Attributes:
| Name | Type | Description |
|---|---|---|
epsilon |
float
|
fudge factor parameter |
Source code in pycytominer/operations/transform.py
263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 | |
fit(X, y=None)
¶
Compute the median and mad to be used for later scaling.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X |
DataFrame
|
dataframe to fit RobustMAD transform |
required |
Returns:
| Type | Description |
|---|---|
self
|
With computed median and mad attributes |
Source code in pycytominer/operations/transform.py
277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 | |
transform(X, copy=None)
¶
Apply the RobustMAD calculation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X |
DataFrame
|
dataframe to fit RobustMAD transform |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
RobustMAD transformed dataframe |
Source code in pycytominer/operations/transform.py
300 301 302 303 304 305 306 307 308 309 310 311 312 313 | |
Spherize
¶
Bases: BaseEstimator, TransformerMixin
Class to apply a sphering transform (aka whitening) data in the base sklearn transform API.
This implementation is modified/inspired from the following sources: 1) A custom function written by Juan C. Caicedo 2) A custom ZCA function at https://github.com/mwv/zca 3) Notes from Niranj Chandrasekaran (https://github.com/cytomining/pycytominer/issues/90) 4) The R package "whitening" written by Strimmer et al (http://strimmerlab.org/software/whitening/) 5) Kessy et al. 2016 "Optimal Whitening and Decorrelation" [1]_.
Attributes:
| Name | Type | Description |
|---|---|---|
epsilon |
float
|
fudge factor parameter |
center |
bool
|
option to center the input X matrix |
method |
str
|
a string indicating which class of sphering to perform |
Source code in pycytominer/operations/transform.py
15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 | |
__init__(epsilon=1e-06, center=True, method='ZCA', return_numpy=False)
¶
Construct a Spherize object.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
epsilon |
float
|
fudge factor parameter |
1e-6
|
center |
bool
|
option to center the input X matrix |
True
|
method |
str
|
a string indicating which class of sphering to perform |
"ZCA"
|
return_numpy |
option to return ndarray, instead of dataframe |
False
|
Source code in pycytominer/operations/transform.py
35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 | |
fit(X, y=None)
¶
Identify the sphering transform given self.X.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X |
DataFrame
|
dataframe to fit sphering transform |
required |
Returns:
| Type | Description |
|---|---|
self
|
With computed weights attribute |
Source code in pycytominer/operations/transform.py
68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 | |
transform(X, y=None)
¶
Perform the sphering transform.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X |
DataFrame
|
Profile dataframe to be transformed using the precompiled weights |
required |
y |
None
|
Has no effect; only used for consistency in sklearn transform API |
None
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
Spherized dataframe |
Source code in pycytominer/operations/transform.py
224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 | |
pycytominer.operations.variance_threshold
¶
Remove variables with near-zero variance.
Modified from caret::nearZeroVar().
calculate_frequency(feature_column, freq_cut)
¶
Calculate frequency of second most common to most common feature.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
feature_column |
series
|
Pandas series of the specific feature in the population_df |
required |
freq_cut |
float
|
Ratio (2nd most common feature val / most common). Must range between 0 and 1. Remove features lower than freq_cut. A low freq_cut will remove features that have large difference between the most common feature and second most common feature. (e.g. this will remove a feature: [1, 1, 1, 1, 0.01, 0.01, ...]) |
0.05
|
Returns:
| Type | Description |
|---|---|
Feature name if it passes threshold, "NA" otherwise
|
|
Source code in pycytominer/operations/variance_threshold.py
80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 | |
variance_threshold(population_df, features='infer', samples='all', freq_cut=0.05, unique_cut=0.01)
¶
Exclude features that have low variance (low information content).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
population_df |
DataFrame
|
DataFrame that includes metadata and observation features. |
required |
features |
list
|
List of features present in the population dataframe [default: "infer"] if "infer", then assume cell painting features are those that start with "Cells_", "Nuclei_", or "Cytoplasm_". |
"infer"
|
samples |
str
|
List of samples to perform operation on. The function uses a pd.DataFrame.query() function, so you should structure samples in this fashion. An example is "Metadata_treatment == 'control'" (include all quotes). If "all", use all samples to calculate. |
"all"
|
freq_cut |
float
|
Ratio (2nd most common feature val / most common). Must range between 0 and 1. Remove features lower than freq_cut. A low freq_cut will remove features that have large difference between the most common feature value and second most common feature value. (e.g. this will remove a feature: [1, 1, 1, 1, 0.01, 0.01, ...]) |
0.05
|
unique_cut |
Ratio (num unique features / num samples). Must range between 0 and 1. Remove features less than unique cut. A low unique_cut will remove features that have very few different measurements compared to the number of samples. |
0.01
|
Returns:
| Name | Type | Description |
|---|---|---|
excluded_features |
list of str
|
List of features to exclude from the population_df. |
Source code in pycytominer/operations/variance_threshold.py
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 | |