Preprocessing¶
Low-level preprocessing functions available via sclab.preprocess.
Quality Control¶
qc
¶
qc(adata: AnnData, counts_layer: str = 'counts', min_counts: int = 50, min_genes: int = 5, min_cells: int = 5, max_rank: int = 0)
Compute quality-control metrics and apply initial cell/gene filters.
Temporarily sets adata.X to the counts layer to calculate QC metrics,
then restores the original X. Adds a barcode_rank column to
adata.obs (rank by descending total counts).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
adata
|
AnnData
|
Annotated data matrix. Modified in-place. |
required |
counts_layer
|
str
|
Layer containing raw counts. Created from |
'counts'
|
min_counts
|
int
|
Minimum total counts per cell. Cells below this threshold are removed before QC metrics are computed. Default is 50. |
50
|
min_genes
|
int
|
Minimum number of genes detected per cell. Default is 5. |
5
|
min_cells
|
int
|
Minimum number of cells a gene must be detected in. Default is 5. |
5
|
max_rank
|
int
|
If > 0, keep only cells with |
0
|
Returns:
| Type | Description |
|---|---|
None
|
Modifies |
Normalization & Transformation¶
preprocess
¶
preprocess(adata: AnnData, counts_layer: str = 'counts', group_by: str | None = None, min_cells: int = 5, min_genes: int = 5, compute_hvg: bool = True, regress_total_counts: bool = False, regress_n_genes: bool = False, normalization_method: Literal['library', 'weighted', 'none'] = 'library', target_scale: float = 10000.0, log1p: bool = True, scale: bool = True)
Normalize, transform, and scale single-cell RNA-seq count data.
Applies a configurable preprocessing pipeline: optional filtering,
highly-variable gene selection, normalization, log1p transformation,
optional covariate regression, and per-group scaling. The resulting
processed matrix is stored in a new named layer whose suffix encodes
the applied steps (e.g. counts_normt_log1p_scale).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
adata
|
AnnData
|
Annotated data matrix. Modified in-place. |
required |
counts_layer
|
str
|
Layer containing raw counts. Default is |
'counts'
|
group_by
|
str or None
|
Column in |
None
|
min_cells
|
int
|
Minimum number of cells a gene must be detected in to be retained. Default is 5. |
5
|
min_genes
|
int
|
Minimum number of genes detected per cell to be retained. Default is 5. |
5
|
compute_hvg
|
bool
|
If True, compute highly variable genes (union of Seurat and Seurat
v3 selections) and store the result in |
True
|
regress_total_counts
|
bool
|
If True, regress out total counts (or log1p total counts if
|
False
|
regress_n_genes
|
bool
|
If True, regress out the number of detected genes per cell. Default is False. |
False
|
normalization_method
|
(library, weighted, none)
|
Normalization strategy. |
"library"
|
target_scale
|
float
|
Target sum for library-size normalization (counts per cell after normalization). Default is 1e4. |
10000.0
|
log1p
|
bool
|
If True, apply log(x + 1) transformation after normalization. Default is True. |
True
|
scale
|
bool
|
If True, scale each gene to unit variance (zero-center disabled).
Applied per group when |
True
|
Returns:
| Type | Description |
|---|---|
None
|
Modifies |
normalize_weighted
¶
normalize_weighted(adata: AnnData, target_scale: float | None = None, batch_key: str | None = None) -> None
Normalize counts using entropy-weighted library-size normalization.
Each gene's contribution to each cell's library size is weighted by the
information-entropy of that gene's count distribution across cells. This
up-weights ubiquitously expressed genes in the library-size calculation,
so that normalization is driven primarily by housekeeping genes rather
than informative ones. When batch_key is provided, normalization is
applied independently within each batch so that cross-batch count
differences do not confound the weights.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
adata
|
AnnData
|
Annotated data matrix. |
required |
target_scale
|
float or None
|
Target library size after normalization. If None, this is set to 1e4 by default. Default is None. |
None
|
batch_key
|
str or None
|
Column in |
None
|
Returns:
| Type | Description |
|---|---|
None
|
Updates |
Dimensionality Reduction¶
pca
¶
pca(adata: AnnData, layer: str | None = None, n_comps: int = 30, mask_var: str | None = None, batch_key: str | None = None, reference_batch: str | None = None, zero_center: bool = False)
Compute principal components and project all cells onto the PCA space.
When reference_batch is provided, PCA is fitted on the reference
batch only and all cells are projected onto those principal components.
This prevents the PC axes from being dominated by batch effects.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
adata
|
AnnData
|
Annotated data matrix. Modified in-place. |
required |
layer
|
str or None
|
Layer to use as input to PCA. Uses |
None
|
n_comps
|
int
|
Number of principal components to compute. Default is 30. |
30
|
mask_var
|
str or None
|
Boolean column in |
None
|
batch_key
|
str or None
|
Column in |
None
|
reference_batch
|
str or None
|
Batch value in |
None
|
zero_center
|
bool
|
If True, subtract the mean of the PC coordinates so that the embedding is centred at the origin. Default is False. |
False
|
Returns:
| Type | Description |
|---|---|
None
|
Modifies
|
Batch Integration¶
cca_integrate
¶
cca_integrate(adata: AnnData, key: str, *, basis: str = 'X', adjusted_basis: str | None = None, reference_batch: str | list[str] | None = None, mask_var: str | None = None, n_components: int = 30, svd_solver: str = 'randomized', normalize: bool = True, random_state: int | None = None)
harmony_integrate
¶
harmony_integrate(adata: AnnData, key: str | Sequence[str], *, basis: str = 'X_pca', adjusted_basis: str | None = None, reference_batch: str | list[str] | None = None, **kwargs)
Integrate batch embeddings using Harmony.
Runs the Harmony algorithm on a cell embedding (default X_pca) to
remove batch effects. The corrected embedding is stored in a new
obsm key.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
adata
|
AnnData
|
Annotated data matrix. Modified in-place. |
required |
key
|
str or sequence of str
|
Column(s) in |
required |
basis
|
str
|
Key in |
'X_pca'
|
adjusted_basis
|
str or None
|
Key in |
None
|
reference_batch
|
str or list of str or None
|
Batch value(s) to use as reference. Reference cells are kept fixed during Harmony correction. Default is None. |
None
|
**kwargs
|
Additional keyword arguments forwarded to
:func: |
{}
|
Returns:
| Type | Description |
|---|---|
None
|
Stores the corrected embedding in |
References
Korsunsky et al. (2019). Fast, sensitive and accurate integration of single-cell data with Harmony. Nature Methods, 16, 1289–1296. https://doi.org/10.1038/s41592-019-0619-0
Filtering¶
filter_obs
¶
filter_obs(adata: AnnData, *, layer: str | None = None, min_counts: int | None = None, min_genes: int | None = None, max_counts: int | None = None, max_cells: int | None = None) -> None
Filter observations (cells) based on count and gene-detection thresholds.
All filtering criteria are applied simultaneously; cells that fail any active criterion are removed. Only criteria that are not None are applied.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
adata
|
AnnData
|
Annotated data matrix. Modified in-place. |
required |
layer
|
str or None
|
Layer to use for count computations. Uses |
None
|
min_counts
|
int or None
|
Minimum total counts per cell. Cells with fewer counts are removed. |
None
|
min_genes
|
int or None
|
Minimum number of genes detected (count > 0) per cell. |
None
|
max_counts
|
int or None
|
Maximum total counts per cell. Cells with more counts are removed. |
None
|
max_cells
|
int or None
|
Maximum number of cells to retain, keeping those with the highest total counts (i.e. keep the top max_cells cells by total counts). |
None
|
Returns:
| Type | Description |
|---|---|
None
|
Modifies |
subset_obs
¶
Subset observations (rows) in an AnnData object.
This function modifies the AnnData object in-place by selecting a subset of observations based on the provided subset parameter. The subsetting can be done using observation names, integer indices, a boolean mask, a query string, or a pandas Index.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
adata
|
AnnData
|
The annotated data matrix to subset. Will be modified in-place. |
required |
subset
|
Index | Sequence[str | int | bool] | str
|
The subset specification. Can be one of:
* A pandas Index containing observation names
* A sequence of observation names (strings)
* A sequence of integer indices
* A boolean mask of length |
required |
Examples:
>>> # Create an example AnnData object
>>> import anndata
>>> import pandas as pd
>>> import numpy as np
>>>
>>> obs = pd.DataFrame(
... index=['A', 'B', 'C'],
... data={'cell_type': ['type1', 'type2', 'type2']})
>>> adata_ = anndata.AnnData(obs=obs)
>>>
>>> # Subset using pandas Index
>>> adata = adata_.copy()
>>> subset_obs(adata, pd.Index(['B', 'C']))
>>> adata.obs_names.tolist()
['B', 'C']
>>>
>>> # Subset using observation names
>>> adata = adata_.copy()
>>> subset_obs(adata, ['A', 'B'])
>>> adata.obs_names.tolist()
['A', 'B']
>>>
>>> # Subset using integer indices
>>> adata = adata_.copy()
>>> subset_obs(adata, [0, 1])
>>> adata.obs_names.tolist()
['A', 'B']
>>>
>>> # Subset using boolean mask
>>> adata = adata_.copy()
>>> subset_obs(adata, [True, False, True])
>>> adata.obs_names.tolist()
['A', 'C']
>>>
>>> # Subset using query string
>>> adata = adata_.copy()
>>> subset_obs(adata, 'cell_type == "type2"')
>>> adata.obs_names.tolist()
['B', 'C']
Notes
- The function modifies the AnnData object in-place
- When using a boolean mask, its length must match the number of observations
- When using integer indices, they must be valid indices for the observations
- Invalid observation names or indices will raise KeyError or IndexError respectively
- The order of observations in the output will match the order in the subset parameter
subset_var
¶
Subset variables (columns) in an AnnData object.
This function modifies the AnnData object in-place by selecting a subset of variables based on the provided subset parameter. The subsetting can be done using variable names, integer indices, a boolean mask, a query string, or a pandas Index.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
adata
|
AnnData
|
The annotated data matrix to subset. Will be modified in-place. |
required |
subset
|
Index | Sequence[str | int | bool] | str
|
The subset specification. Can be one of:
* A pandas Index containing variable names
* A sequence of variable names (strings)
* A sequence of integer indices
* A boolean mask of length |
required |
Examples:
>>> # Create an example AnnData object
>>> import anndata
>>> import pandas as pd
>>> import numpy as np
>>>
>>> var = pd.DataFrame(
... index=['gene1', 'gene2', 'gene3'],
... data={'gene_type': ['type1', 'type2', 'type1']})
>>> adata_ = anndata.AnnData(var=var)
>>>
>>> # Subset using pandas Index
>>> adata = adata_.copy()
>>> subset_var(adata, pd.Index(['gene2', 'gene3']))
>>> adata.var_names.tolist()
['gene2', 'gene3']
>>>
>>> # Subset using variable names
>>> adata = adata_.copy()
>>> subset_var(adata, ['gene1', 'gene2'])
>>> adata.var_names.tolist()
['gene1', 'gene2']
>>>
>>> # Subset using integer indices
>>> adata = adata_.copy()
>>> subset_var(adata, [0, 1])
>>> adata.var_names.tolist()
['gene1', 'gene2']
>>>
>>> # Subset using boolean mask
>>> adata = adata_.copy()
>>> subset_var(adata, [True, False, True])
>>> adata.var_names.tolist()
['gene1', 'gene3']
>>>
>>> # Subset using query string
>>> adata = adata_.copy()
>>> subset_var(adata, 'gene_type == "type1"')
>>> adata.var_names.tolist()
['gene1', 'gene3']
Notes
- The function modifies the AnnData object in-place
- When using a boolean mask, its length must match the number of variables
- When using integer indices, they must be valid indices for the variables
- Invalid variable names or indices will raise KeyError or IndexError respectively
- The order of variables in the output will match the order in the subset parameter
Metadata¶
transfer_metadata
¶
transfer_metadata(adata: AnnData, group_key: str, source_group: str, column: str, periodic: bool = False, vmin: float = 0, vmax: float = 1, min_neighs: int = 5, weight_by: Literal['connectivity', 'distance', 'constant'] = 'connectivity')
Transfer a metadata column from a source group to the rest of the cells.
Uses the k-nearest-neighbor graph (adata.obsp["connectivities"] and
adata.obsp["distances"]) to propagate values from labeled cells
(source_group) to unlabeled cells. Results are stored as new columns
with the transferred_ prefix.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
adata
|
AnnData
|
Annotated data matrix with a computed neighbor graph. Modified in-place. |
required |
group_key
|
str
|
Column in |
required |
source_group
|
str
|
Value in |
required |
column
|
str
|
Column in |
required |
periodic
|
bool
|
If True, treat |
False
|
vmin
|
float
|
Minimum value for periodic wrapping. Default is 0. |
0
|
vmax
|
float
|
Maximum value for periodic wrapping. Default is 1. |
1
|
min_neighs
|
int
|
Minimum number of labeled neighbors required to assign a value. Cells with fewer labeled neighbors are left as NaN. Default is 5. |
5
|
weight_by
|
(connectivity, distance, constant)
|
How to weight neighbors when aggregating values. |
"connectivity"
|
Returns:
| Type | Description |
|---|---|
None
|
Adds |
propagate_metadata
¶
propagate_metadata(adata: AnnData, column: str, periodic: bool = False, vmin: float = 0, vmax: float = 1, min_neighs: int = 5, weight_by: Literal['connectivity', 'distance', 'constant'] = 'connectivity')
Fill missing values in a metadata column by propagation through the neighbor graph.
Cells that already have a value in column are used as anchors; NaN
cells receive an estimated value from their labeled neighbors. Useful
for imputing partially annotated metadata (e.g. pseudotime or cell-type
labels) based on the k-NN graph structure.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
adata
|
AnnData
|
Annotated data matrix with a computed neighbor graph. Modified in-place. |
required |
column
|
str
|
Column in |
required |
periodic
|
bool
|
If True, treat the variable as periodic (circular). Default is False. |
False
|
vmin
|
float
|
Minimum value for periodic wrapping. Default is 0. |
0
|
vmax
|
float
|
Maximum value for periodic wrapping. Default is 1. |
1
|
min_neighs
|
int
|
Minimum number of labeled neighbors required to assign a value. Default is 5. |
5
|
weight_by
|
(connectivity, distance, constant)
|
Neighbor weighting scheme. Default is |
"connectivity"
|
Returns:
| Type | Description |
|---|---|
None
|
Fills NaN entries in |
Utilities¶
pool_neighbors
¶
pool_neighbors(adata: AnnData, *, key: str | None = None, key_periodic: bool = False, key_min: float | None = None, key_max: float | None = None, n_neighbors: Optional[int] = None, neighbors_key: str = 'neighbors', weighted: bool = False, directed: bool = True, key_added: Optional[str] = None, copy: bool = False) -> csr_matrix | ndarray | None
Given an adjacency matrix, pool cell features using a weighted sum of feature counts from neighboring cells. The weights are the normalized connectivities from the adjacency matrix.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
adata
|
AnnData
|
Annotated data matrix. |
required |
key
|
str
|
Key in AnnData object to use for pooling. It can be a key in adata.obs, adata.layers, or adata.obsm. Defaults to None. |
None
|
key_periodic
|
bool
|
Whether to use periodic boundary conditions for pooling. It is only used if key is a key in adata.obs. Defaults to False. |
False
|
key_min
|
float
|
Minimum value for column in adata.obs to use for pooling. It is only used
if key is a key in adata.obs. Defaults to None. Must be provided if
|
None
|
key_max
|
float
|
Maximum value for column in adata.obs to use for pooling. It is only used
if key is a key in adata.obs. Defaults to None. Must be provided if
|
None
|
n_neighbors
|
int
|
Number of neighbors to consider. Defaults to None. |
None
|
neighbors_key
|
str
|
Key in AnnData object to use for neighbors. Defaults to None. |
'neighbors'
|
weighted
|
bool
|
Whether to weight neighbors by their connectivities in the adjacency matrix. Defaults to False. |
False
|
directed
|
bool
|
Whether to use directed or undirected neighbors. Defaults to True. |
True
|
key_added
|
str
|
Key to use in AnnData object for the pooled features. Defaults to None. |
None
|
copy
|
bool
|
Whether to return a copy of the pooled features instead of modifying the original AnnData object. Defaults to False. |
False
|
Returns:
| Type | Description |
|---|---|
csr_matrix | ndarray | None
|
The pooled features if copy is True, otherwise None. |