Preprocessing¶

Low-level preprocessing functions available via sclab.preprocess.

Quality Control¶

qc ¶

qc(adata: AnnData, counts_layer: str = 'counts', min_counts: int = 50, min_genes: int = 5, min_cells: int = 5, max_rank: int = 0)

Compute quality-control metrics and apply initial cell/gene filters.

Temporarily sets adata.X to the counts layer to calculate QC metrics, then restores the original X. Adds a barcode_rank column to adata.obs (rank by descending total counts).

Parameters:

Name	Type	Description	Default
`adata`	`AnnData`	Annotated data matrix. Modified in-place.	required
`counts_layer`	`str`	Layer containing raw counts. Created from `adata.X` if absent. Default is `"counts"`.	`'counts'`
`min_counts`	`int`	Minimum total counts per cell. Cells below this threshold are removed before QC metrics are computed. Default is 50.	`50`
`min_genes`	`int`	Minimum number of genes detected per cell. Default is 5.	`5`
`min_cells`	`int`	Minimum number of cells a gene must be detected in. Default is 5.	`5`
`max_rank`	`int`	If > 0, keep only cells with `barcode_rank < max_rank` (i.e. the top max_rank cells by total counts). Default is 0 (disabled).	`0`

Returns:

Type	Description
`None`	Modifies `adata` in-place. Adds QC columns to `adata.obs` and `adata.var` via :func:`scanpy.pp.calculate_qc_metrics`.

Normalization & Transformation¶

preprocess ¶

preprocess(adata: AnnData, counts_layer: str = 'counts', group_by: str | None = None, min_cells: int = 5, min_genes: int = 5, compute_hvg: bool = True, regress_total_counts: bool = False, regress_n_genes: bool = False, normalization_method: Literal['library', 'weighted', 'none'] = 'library', target_scale: float = 10000.0, log1p: bool = True, scale: bool = True)

Normalize, transform, and scale single-cell RNA-seq count data.

Applies a configurable preprocessing pipeline: optional filtering, highly-variable gene selection, normalization, log1p transformation, optional covariate regression, and per-group scaling. The resulting processed matrix is stored in a new named layer whose suffix encodes the applied steps (e.g. counts_normt_log1p_scale).

Parameters:

Name	Type	Description	Default
`adata`	`AnnData`	Annotated data matrix. Modified in-place.	required
`counts_layer`	`str`	Layer containing raw counts. Default is `"counts"`.	`'counts'`
`group_by`	`str or None`	Column in `adata.obs` for per-group HVG selection, normalization, and scaling. When set, batch-aware processing is applied. Default is None.	`None`
`min_cells`	`int`	Minimum number of cells a gene must be detected in to be retained. Default is 5.	`5`
`min_genes`	`int`	Minimum number of genes detected per cell to be retained. Default is 5.	`5`
`compute_hvg`	`bool`	If True, compute highly variable genes (union of Seurat and Seurat v3 selections) and store the result in `adata.var["highly_variable"]`. Default is True.	`True`
`regress_total_counts`	`bool`	If True, regress out total counts (or log1p total counts if `log1p=True`) per cell. Default is False.	`False`
`regress_n_genes`	`bool`	If True, regress out the number of detected genes per cell. Default is False.	`False`
`normalization_method`	`(library, weighted, none)`	Normalization strategy. `"library"` applies library-size normalization to `target_scale` counts; `"weighted"` applies entropy-weighted normalization; `"none"` skips normalization. Default is `"library"`.	`"library"`
`target_scale`	`float`	Target sum for library-size normalization (counts per cell after normalization). Default is 1e4.	`10000.0`
`log1p`	`bool`	If True, apply log(x + 1) transformation after normalization. Default is True.	`True`
`scale`	`bool`	If True, scale each gene to unit variance (zero-center disabled). Applied per group when `group_by` is set. Default is True.	`True`

Returns:

Type	Description
`None`	Modifies `adata` in-place. Stores the processed matrix in a new layer and updates `adata.X`.

normalize_weighted ¶

normalize_weighted(adata: AnnData, target_scale: float | None = None, batch_key: str | None = None) -> None

Normalize counts using entropy-weighted library-size normalization.

Each gene's contribution to each cell's library size is weighted by the information-entropy of that gene's count distribution across cells. This up-weights ubiquitously expressed genes in the library-size calculation, so that normalization is driven primarily by housekeeping genes rather than informative ones. When batch_key is provided, normalization is applied independently within each batch so that cross-batch count differences do not confound the weights.

Parameters:

Name	Type	Description	Default
`adata`	`AnnData`	Annotated data matrix. `adata.X` must be a sparse count matrix. Modified in-place.	required
`target_scale`	`float or None`	Target library size after normalization. If None, this is set to 1e4 by default. Default is None.	`None`
`batch_key`	`str or None`	Column in `adata.obs` identifying batches. Normalization is applied independently per batch when set. Default is None.	`None`

Returns:

Type	Description
`None`	Updates `adata.X` in-place with the normalized count matrix.

Dimensionality Reduction¶

pca ¶

pca(adata: AnnData, layer: str | None = None, n_comps: int = 30, mask_var: str | None = None, batch_key: str | None = None, reference_batch: str | None = None, zero_center: bool = False)

Compute principal components and project all cells onto the PCA space.

When reference_batch is provided, PCA is fitted on the reference batch only and all cells are projected onto those principal components. This prevents the PC axes from being dominated by batch effects.

Parameters:

Name	Type	Description	Default
`adata`	`AnnData`	Annotated data matrix. Modified in-place.	required
`layer`	`str or None`	Layer to use as input to PCA. Uses `adata.X` if None. Default is None.	`None`
`n_comps`	`int`	Number of principal components to compute. Default is 30.	`30`
`mask_var`	`str or None`	Boolean column in `adata.var` used to select a gene subset for PCA (e.g. `"highly_variable"`). Default is None (use all genes).	`None`
`batch_key`	`str or None`	Column in `adata.obs` identifying batches. Required when `reference_batch` is set. Default is None.	`None`
`reference_batch`	`str or None`	Batch value in `adata.obs[batch_key]` to use for fitting the PCA model. All cells are then projected onto the reference PCs. Default is None (fit PCA on all cells).	`None`
`zero_center`	`bool`	If True, subtract the mean of the PC coordinates so that the embedding is centred at the origin. Default is False.	`False`

Returns:

Type	Description
`None`	Modifies `adata` in-place, storing results in: `adata.obsm["X_pca"]` — PC coordinates for all cells. `adata.varm["PCs"]` — loadings matrix. `adata.uns["pca"]` — variance and variance-ratio arrays; also stores `"reference_batch"` when fitted on a reference batch.

Batch Integration¶

cca_integrate ¶

cca_integrate(adata: AnnData, key: str, *, basis: str = 'X', adjusted_basis: str | None = None, reference_batch: str | list[str] | None = None, mask_var: str | None = None, n_components: int = 30, svd_solver: str = 'randomized', normalize: bool = True, random_state: int | None = None)

harmony_integrate ¶

harmony_integrate(adata: AnnData, key: str | Sequence[str], *, basis: str = 'X_pca', adjusted_basis: str | None = None, reference_batch: str | list[str] | None = None, **kwargs)

Integrate batch embeddings using Harmony.

Runs the Harmony algorithm on a cell embedding (default X_pca) to remove batch effects. The corrected embedding is stored in a new obsm key.

Parameters:

Name	Type	Description	Default
`adata`	`AnnData`	Annotated data matrix. Modified in-place.	required
`key`	`str or sequence of str`	Column(s) in `adata.obs` identifying batches to correct for.	required
`basis`	`str`	Key in `adata.obsm` containing the input embedding. Default is `"X_pca"`.	`'X_pca'`
`adjusted_basis`	`str or None`	Key in `adata.obsm` where the corrected embedding is stored. If None, defaults to `"{basis}_harmony"`. Default is None.	`None`
`reference_batch`	`str or list of str or None`	Batch value(s) to use as reference. Reference cells are kept fixed during Harmony correction. Default is None.	`None`
`**kwargs`		Additional keyword arguments forwarded to :func:`~sclab.preprocess._harmony.run_harmony`.	`{}`

Returns:

Type	Description
`None`	Stores the corrected embedding in `adata.obsm[adjusted_basis]`.

References

Korsunsky et al. (2019). Fast, sensitive and accurate integration of single-cell data with Harmony. Nature Methods, 16, 1289–1296. https://doi.org/10.1038/s41592-019-0619-0

Filtering¶

filter_obs ¶

filter_obs(adata: AnnData, *, layer: str | None = None, min_counts: int | None = None, min_genes: int | None = None, max_counts: int | None = None, max_cells: int | None = None) -> None

Filter observations (cells) based on count and gene-detection thresholds.

All filtering criteria are applied simultaneously; cells that fail any active criterion are removed. Only criteria that are not None are applied.

Parameters:

Name	Type	Description	Default
`adata`	`AnnData`	Annotated data matrix. Modified in-place.	required
`layer`	`str or None`	Layer to use for count computations. Uses `adata.X` if None. Default is None.	`None`
`min_counts`	`int or None`	Minimum total counts per cell. Cells with fewer counts are removed.	`None`
`min_genes`	`int or None`	Minimum number of genes detected (count > 0) per cell.	`None`
`max_counts`	`int or None`	Maximum total counts per cell. Cells with more counts are removed.	`None`
`max_cells`	`int or None`	Maximum number of cells to retain, keeping those with the highest total counts (i.e. keep the top max_cells cells by total counts).	`None`

Returns:

Type	Description
`None`	Modifies `adata` in-place by subsetting observations.

subset_obs ¶

subset_obs(adata: AnnData, subset: Index | Sequence[str | int | bool] | str) -> None

Subset observations (rows) in an AnnData object.

This function modifies the AnnData object in-place by selecting a subset of observations based on the provided subset parameter. The subsetting can be done using observation names, integer indices, a boolean mask, a query string, or a pandas Index.

Parameters:

Name	Type	Description	Default
`adata`	`AnnData`	The annotated data matrix to subset. Will be modified in-place.	required
`subset`	`Index \| Sequence[str \| int \| bool] \| str`	The subset specification. Can be one of: * A pandas Index containing observation names * A sequence of observation names (strings) * A sequence of integer indices * A boolean mask of length `adata.n_obs` * A query string to match observations by their metadata columns	required

Examples:

>>> # Create an example AnnData object
>>> import anndata
>>> import pandas as pd
>>> import numpy as np
>>>
>>> obs = pd.DataFrame(
...     index=['A', 'B', 'C'],
...     data={'cell_type': ['type1', 'type2', 'type2']})
>>> adata_ = anndata.AnnData(obs=obs)
>>>
>>> # Subset using pandas Index
>>> adata = adata_.copy()
>>> subset_obs(adata, pd.Index(['B', 'C']))
>>> adata.obs_names.tolist()
['B', 'C']
>>>
>>> # Subset using observation names
>>> adata = adata_.copy()
>>> subset_obs(adata, ['A', 'B'])
>>> adata.obs_names.tolist()
['A', 'B']
>>>
>>> # Subset using integer indices
>>> adata = adata_.copy()
>>> subset_obs(adata, [0, 1])
>>> adata.obs_names.tolist()
['A', 'B']
>>>
>>> # Subset using boolean mask
>>> adata = adata_.copy()
>>> subset_obs(adata, [True, False, True])
>>> adata.obs_names.tolist()
['A', 'C']
>>>
>>> # Subset using query string
>>> adata = adata_.copy()
>>> subset_obs(adata, 'cell_type == "type2"')
>>> adata.obs_names.tolist()
['B', 'C']

Notes

The function modifies the AnnData object in-place
When using a boolean mask, its length must match the number of observations
When using integer indices, they must be valid indices for the observations
Invalid observation names or indices will raise KeyError or IndexError respectively
The order of observations in the output will match the order in the subset parameter

subset_var ¶

subset_var(adata: AnnData, subset: Index | Sequence[str | int | bool] | str) -> None

Subset variables (columns) in an AnnData object.

This function modifies the AnnData object in-place by selecting a subset of variables based on the provided subset parameter. The subsetting can be done using variable names, integer indices, a boolean mask, a query string, or a pandas Index.

Parameters:

Name	Type	Description	Default
`adata`	`AnnData`	The annotated data matrix to subset. Will be modified in-place.	required
`subset`	`Index \| Sequence[str \| int \| bool] \| str`	The subset specification. Can be one of: * A pandas Index containing variable names * A sequence of variable names (strings) * A sequence of integer indices * A boolean mask of length `adata.n_vars` * A query string to match variables by their metadata columns	required

Examples:

>>> # Create an example AnnData object
>>> import anndata
>>> import pandas as pd
>>> import numpy as np
>>>
>>> var = pd.DataFrame(
...     index=['gene1', 'gene2', 'gene3'],
...     data={'gene_type': ['type1', 'type2', 'type1']})
>>> adata_ = anndata.AnnData(var=var)
>>>
>>> # Subset using pandas Index
>>> adata = adata_.copy()
>>> subset_var(adata, pd.Index(['gene2', 'gene3']))
>>> adata.var_names.tolist()
['gene2', 'gene3']
>>>
>>> # Subset using variable names
>>> adata = adata_.copy()
>>> subset_var(adata, ['gene1', 'gene2'])
>>> adata.var_names.tolist()
['gene1', 'gene2']
>>>
>>> # Subset using integer indices
>>> adata = adata_.copy()
>>> subset_var(adata, [0, 1])
>>> adata.var_names.tolist()
['gene1', 'gene2']
>>>
>>> # Subset using boolean mask
>>> adata = adata_.copy()
>>> subset_var(adata, [True, False, True])
>>> adata.var_names.tolist()
['gene1', 'gene3']
>>>
>>> # Subset using query string
>>> adata = adata_.copy()
>>> subset_var(adata, 'gene_type == "type1"')
>>> adata.var_names.tolist()
['gene1', 'gene3']

Notes

The function modifies the AnnData object in-place
When using a boolean mask, its length must match the number of variables
When using integer indices, they must be valid indices for the variables
Invalid variable names or indices will raise KeyError or IndexError respectively
The order of variables in the output will match the order in the subset parameter

Metadata¶

transfer_metadata ¶

transfer_metadata(adata: AnnData, group_key: str, source_group: str, column: str, periodic: bool = False, vmin: float = 0, vmax: float = 1, min_neighs: int = 5, weight_by: Literal['connectivity', 'distance', 'constant'] = 'connectivity')

Transfer a metadata column from a source group to the rest of the cells.

Uses the k-nearest-neighbor graph (adata.obsp["connectivities"] and adata.obsp["distances"]) to propagate values from labeled cells (source_group) to unlabeled cells. Results are stored as new columns with the transferred_ prefix.

Parameters:

Name	Type	Description	Default
`adata`	`AnnData`	Annotated data matrix with a computed neighbor graph. Modified in-place.	required
`group_key`	`str`	Column in `adata.obs` identifying the groups (e.g. `"batch"`).	required
`source_group`	`str`	Value in `adata.obs[group_key]` whose cells serve as the labeled source. Cells in all other groups receive transferred values.	required
`column`	`str`	Column in `adata.obs` containing the values to transfer (numeric, categorical, or boolean).	required
`periodic`	`bool`	If True, treat `column` as a periodic variable (e.g. cell-cycle phase in [vmin, vmax]). Default is False.	`False`
`vmin`	`float`	Minimum value for periodic wrapping. Default is 0.	`0`
`vmax`	`float`	Maximum value for periodic wrapping. Default is 1.	`1`
`min_neighs`	`int`	Minimum number of labeled neighbors required to assign a value. Cells with fewer labeled neighbors are left as NaN. Default is 5.	`5`
`weight_by`	`(connectivity, distance, constant)`	How to weight neighbors when aggregating values. `"connectivity"` uses the connectivity matrix; `"distance"` uses inverse distances; `"constant"` gives equal weight to all neighbors. Default is `"connectivity"`.	`"connectivity"`

Returns:

Type	Description
`None`	Adds `transferred_{column}` and `transferred_{column}_error` (or `transferred_{column}_proportion` for categorical columns) to `adata.obs`.

propagate_metadata ¶

propagate_metadata(adata: AnnData, column: str, periodic: bool = False, vmin: float = 0, vmax: float = 1, min_neighs: int = 5, weight_by: Literal['connectivity', 'distance', 'constant'] = 'connectivity')

Fill missing values in a metadata column by propagation through the neighbor graph.

Cells that already have a value in column are used as anchors; NaN cells receive an estimated value from their labeled neighbors. Useful for imputing partially annotated metadata (e.g. pseudotime or cell-type labels) based on the k-NN graph structure.

Parameters:

Name	Type	Description	Default
`adata`	`AnnData`	Annotated data matrix with a computed neighbor graph. Modified in-place.	required
`column`	`str`	Column in `adata.obs` with partial values (NaNs to be filled).	required
`periodic`	`bool`	If True, treat the variable as periodic (circular). Default is False.	`False`
`vmin`	`float`	Minimum value for periodic wrapping. Default is 0.	`0`
`vmax`	`float`	Maximum value for periodic wrapping. Default is 1.	`1`
`min_neighs`	`int`	Minimum number of labeled neighbors required to assign a value. Default is 5.	`5`
`weight_by`	`(connectivity, distance, constant)`	Neighbor weighting scheme. Default is `"connectivity"`.	`"connectivity"`

Returns:

Type	Description
`None`	Fills NaN entries in `adata.obs[column]` in-place and adds an error/proportion column (`{column}_error` or `{column}_proportion`).

Utilities¶

pool_neighbors ¶

pool_neighbors(adata: AnnData, *, key: str | None = None, key_periodic: bool = False, key_min: float | None = None, key_max: float | None = None, n_neighbors: Optional[int] = None, neighbors_key: str = 'neighbors', weighted: bool = False, directed: bool = True, key_added: Optional[str] = None, copy: bool = False) -> csr_matrix | ndarray | None

Given an adjacency matrix, pool cell features using a weighted sum of feature counts from neighboring cells. The weights are the normalized connectivities from the adjacency matrix.

Parameters:

Name	Type	Description	Default
`adata`	`AnnData`	Annotated data matrix.	required
`key`	`str`	Key in AnnData object to use for pooling. It can be a key in adata.obs, adata.layers, or adata.obsm. Defaults to None.	`None`
`key_periodic`	`bool`	Whether to use periodic boundary conditions for pooling. It is only used if key is a key in adata.obs. Defaults to False.	`False`
`key_min`	`float`	Minimum value for column in adata.obs to use for pooling. It is only used if key is a key in adata.obs. Defaults to None. Must be provided if `key_periodic` is True.	`None`
`key_max`	`float`	Maximum value for column in adata.obs to use for pooling. It is only used if key is a key in adata.obs. Defaults to None. Must be provided if `key_periodic` is True.	`None`
`n_neighbors`	`int`	Number of neighbors to consider. Defaults to None.	`None`
`neighbors_key`	`str`	Key in AnnData object to use for neighbors. Defaults to None.	`'neighbors'`
`weighted`	`bool`	Whether to weight neighbors by their connectivities in the adjacency matrix. Defaults to False.	`False`
`directed`	`bool`	Whether to use directed or undirected neighbors. Defaults to True.	`True`
`key_added`	`str`	Key to use in AnnData object for the pooled features. Defaults to None.	`None`
`copy`	`bool`	Whether to return a copy of the pooled features instead of modifying the original AnnData object. Defaults to False.	`False`

Returns:

Type	Description
`csr_matrix \| ndarray \| None`	The pooled features if copy is True, otherwise None.