Skip to content

Preprocessing

Low-level preprocessing functions available via sclab.preprocess.

Quality Control

qc

qc(adata: AnnData, counts_layer: str = 'counts', min_counts: int = 50, min_genes: int = 5, min_cells: int = 5, max_rank: int = 0)

Compute quality-control metrics and apply initial cell/gene filters.

Temporarily sets adata.X to the counts layer to calculate QC metrics, then restores the original X. Adds a barcode_rank column to adata.obs (rank by descending total counts).

Parameters:

Name Type Description Default
adata AnnData

Annotated data matrix. Modified in-place.

required
counts_layer str

Layer containing raw counts. Created from adata.X if absent. Default is "counts".

'counts'
min_counts int

Minimum total counts per cell. Cells below this threshold are removed before QC metrics are computed. Default is 50.

50
min_genes int

Minimum number of genes detected per cell. Default is 5.

5
min_cells int

Minimum number of cells a gene must be detected in. Default is 5.

5
max_rank int

If > 0, keep only cells with barcode_rank < max_rank (i.e. the top max_rank cells by total counts). Default is 0 (disabled).

0

Returns:

Type Description
None

Modifies adata in-place. Adds QC columns to adata.obs and adata.var via :func:scanpy.pp.calculate_qc_metrics.


Normalization & Transformation

preprocess

preprocess(adata: AnnData, counts_layer: str = 'counts', group_by: str | None = None, min_cells: int = 5, min_genes: int = 5, compute_hvg: bool = True, regress_total_counts: bool = False, regress_n_genes: bool = False, normalization_method: Literal['library', 'weighted', 'none'] = 'library', target_scale: float = 10000.0, log1p: bool = True, scale: bool = True)

Normalize, transform, and scale single-cell RNA-seq count data.

Applies a configurable preprocessing pipeline: optional filtering, highly-variable gene selection, normalization, log1p transformation, optional covariate regression, and per-group scaling. The resulting processed matrix is stored in a new named layer whose suffix encodes the applied steps (e.g. counts_normt_log1p_scale).

Parameters:

Name Type Description Default
adata AnnData

Annotated data matrix. Modified in-place.

required
counts_layer str

Layer containing raw counts. Default is "counts".

'counts'
group_by str or None

Column in adata.obs for per-group HVG selection, normalization, and scaling. When set, batch-aware processing is applied. Default is None.

None
min_cells int

Minimum number of cells a gene must be detected in to be retained. Default is 5.

5
min_genes int

Minimum number of genes detected per cell to be retained. Default is 5.

5
compute_hvg bool

If True, compute highly variable genes (union of Seurat and Seurat v3 selections) and store the result in adata.var["highly_variable"]. Default is True.

True
regress_total_counts bool

If True, regress out total counts (or log1p total counts if log1p=True) per cell. Default is False.

False
regress_n_genes bool

If True, regress out the number of detected genes per cell. Default is False.

False
normalization_method (library, weighted, none)

Normalization strategy. "library" applies library-size normalization to target_scale counts; "weighted" applies entropy-weighted normalization; "none" skips normalization. Default is "library".

"library"
target_scale float

Target sum for library-size normalization (counts per cell after normalization). Default is 1e4.

10000.0
log1p bool

If True, apply log(x + 1) transformation after normalization. Default is True.

True
scale bool

If True, scale each gene to unit variance (zero-center disabled). Applied per group when group_by is set. Default is True.

True

Returns:

Type Description
None

Modifies adata in-place. Stores the processed matrix in a new layer and updates adata.X.


normalize_weighted

normalize_weighted(adata: AnnData, target_scale: float | None = None, batch_key: str | None = None) -> None

Normalize counts using entropy-weighted library-size normalization.

Each gene's contribution to each cell's library size is weighted by the information-entropy of that gene's count distribution across cells. This up-weights ubiquitously expressed genes in the library-size calculation, so that normalization is driven primarily by housekeeping genes rather than informative ones. When batch_key is provided, normalization is applied independently within each batch so that cross-batch count differences do not confound the weights.

Parameters:

Name Type Description Default
adata AnnData

Annotated data matrix. adata.X must be a sparse count matrix. Modified in-place.

required
target_scale float or None

Target library size after normalization. If None, this is set to 1e4 by default. Default is None.

None
batch_key str or None

Column in adata.obs identifying batches. Normalization is applied independently per batch when set. Default is None.

None

Returns:

Type Description
None

Updates adata.X in-place with the normalized count matrix.


Dimensionality Reduction

pca

pca(adata: AnnData, layer: str | None = None, n_comps: int = 30, mask_var: str | None = None, batch_key: str | None = None, reference_batch: str | None = None, zero_center: bool = False)

Compute principal components and project all cells onto the PCA space.

When reference_batch is provided, PCA is fitted on the reference batch only and all cells are projected onto those principal components. This prevents the PC axes from being dominated by batch effects.

Parameters:

Name Type Description Default
adata AnnData

Annotated data matrix. Modified in-place.

required
layer str or None

Layer to use as input to PCA. Uses adata.X if None. Default is None.

None
n_comps int

Number of principal components to compute. Default is 30.

30
mask_var str or None

Boolean column in adata.var used to select a gene subset for PCA (e.g. "highly_variable"). Default is None (use all genes).

None
batch_key str or None

Column in adata.obs identifying batches. Required when reference_batch is set. Default is None.

None
reference_batch str or None

Batch value in adata.obs[batch_key] to use for fitting the PCA model. All cells are then projected onto the reference PCs. Default is None (fit PCA on all cells).

None
zero_center bool

If True, subtract the mean of the PC coordinates so that the embedding is centred at the origin. Default is False.

False

Returns:

Type Description
None

Modifies adata in-place, storing results in:

  • adata.obsm["X_pca"] — PC coordinates for all cells.
  • adata.varm["PCs"] — loadings matrix.
  • adata.uns["pca"] — variance and variance-ratio arrays; also stores "reference_batch" when fitted on a reference batch.

Batch Integration

cca_integrate

cca_integrate(adata: AnnData, key: str, *, basis: str = 'X', adjusted_basis: str | None = None, reference_batch: str | list[str] | None = None, mask_var: str | None = None, n_components: int = 30, svd_solver: str = 'randomized', normalize: bool = True, random_state: int | None = None)

harmony_integrate

harmony_integrate(adata: AnnData, key: str | Sequence[str], *, basis: str = 'X_pca', adjusted_basis: str | None = None, reference_batch: str | list[str] | None = None, **kwargs)

Integrate batch embeddings using Harmony.

Runs the Harmony algorithm on a cell embedding (default X_pca) to remove batch effects. The corrected embedding is stored in a new obsm key.

Parameters:

Name Type Description Default
adata AnnData

Annotated data matrix. Modified in-place.

required
key str or sequence of str

Column(s) in adata.obs identifying batches to correct for.

required
basis str

Key in adata.obsm containing the input embedding. Default is "X_pca".

'X_pca'
adjusted_basis str or None

Key in adata.obsm where the corrected embedding is stored. If None, defaults to "{basis}_harmony". Default is None.

None
reference_batch str or list of str or None

Batch value(s) to use as reference. Reference cells are kept fixed during Harmony correction. Default is None.

None
**kwargs

Additional keyword arguments forwarded to :func:~sclab.preprocess._harmony.run_harmony.

{}

Returns:

Type Description
None

Stores the corrected embedding in adata.obsm[adjusted_basis].

References

Korsunsky et al. (2019). Fast, sensitive and accurate integration of single-cell data with Harmony. Nature Methods, 16, 1289–1296. https://doi.org/10.1038/s41592-019-0619-0


Filtering

filter_obs

filter_obs(adata: AnnData, *, layer: str | None = None, min_counts: int | None = None, min_genes: int | None = None, max_counts: int | None = None, max_cells: int | None = None) -> None

Filter observations (cells) based on count and gene-detection thresholds.

All filtering criteria are applied simultaneously; cells that fail any active criterion are removed. Only criteria that are not None are applied.

Parameters:

Name Type Description Default
adata AnnData

Annotated data matrix. Modified in-place.

required
layer str or None

Layer to use for count computations. Uses adata.X if None. Default is None.

None
min_counts int or None

Minimum total counts per cell. Cells with fewer counts are removed.

None
min_genes int or None

Minimum number of genes detected (count > 0) per cell.

None
max_counts int or None

Maximum total counts per cell. Cells with more counts are removed.

None
max_cells int or None

Maximum number of cells to retain, keeping those with the highest total counts (i.e. keep the top max_cells cells by total counts).

None

Returns:

Type Description
None

Modifies adata in-place by subsetting observations.


subset_obs

subset_obs(adata: AnnData, subset: Index | Sequence[str | int | bool] | str) -> None

Subset observations (rows) in an AnnData object.

This function modifies the AnnData object in-place by selecting a subset of observations based on the provided subset parameter. The subsetting can be done using observation names, integer indices, a boolean mask, a query string, or a pandas Index.

Parameters:

Name Type Description Default
adata AnnData

The annotated data matrix to subset. Will be modified in-place.

required
subset Index | Sequence[str | int | bool] | str

The subset specification. Can be one of: * A pandas Index containing observation names * A sequence of observation names (strings) * A sequence of integer indices * A boolean mask of length adata.n_obs * A query string to match observations by their metadata columns

required

Examples:

>>> # Create an example AnnData object
>>> import anndata
>>> import pandas as pd
>>> import numpy as np
>>>
>>> obs = pd.DataFrame(
...     index=['A', 'B', 'C'],
...     data={'cell_type': ['type1', 'type2', 'type2']})
>>> adata_ = anndata.AnnData(obs=obs)
>>>
>>> # Subset using pandas Index
>>> adata = adata_.copy()
>>> subset_obs(adata, pd.Index(['B', 'C']))
>>> adata.obs_names.tolist()
['B', 'C']
>>>
>>> # Subset using observation names
>>> adata = adata_.copy()
>>> subset_obs(adata, ['A', 'B'])
>>> adata.obs_names.tolist()
['A', 'B']
>>>
>>> # Subset using integer indices
>>> adata = adata_.copy()
>>> subset_obs(adata, [0, 1])
>>> adata.obs_names.tolist()
['A', 'B']
>>>
>>> # Subset using boolean mask
>>> adata = adata_.copy()
>>> subset_obs(adata, [True, False, True])
>>> adata.obs_names.tolist()
['A', 'C']
>>>
>>> # Subset using query string
>>> adata = adata_.copy()
>>> subset_obs(adata, 'cell_type == "type2"')
>>> adata.obs_names.tolist()
['B', 'C']
Notes
  • The function modifies the AnnData object in-place
  • When using a boolean mask, its length must match the number of observations
  • When using integer indices, they must be valid indices for the observations
  • Invalid observation names or indices will raise KeyError or IndexError respectively
  • The order of observations in the output will match the order in the subset parameter

subset_var

subset_var(adata: AnnData, subset: Index | Sequence[str | int | bool] | str) -> None

Subset variables (columns) in an AnnData object.

This function modifies the AnnData object in-place by selecting a subset of variables based on the provided subset parameter. The subsetting can be done using variable names, integer indices, a boolean mask, a query string, or a pandas Index.

Parameters:

Name Type Description Default
adata AnnData

The annotated data matrix to subset. Will be modified in-place.

required
subset Index | Sequence[str | int | bool] | str

The subset specification. Can be one of: * A pandas Index containing variable names * A sequence of variable names (strings) * A sequence of integer indices * A boolean mask of length adata.n_vars * A query string to match variables by their metadata columns

required

Examples:

>>> # Create an example AnnData object
>>> import anndata
>>> import pandas as pd
>>> import numpy as np
>>>
>>> var = pd.DataFrame(
...     index=['gene1', 'gene2', 'gene3'],
...     data={'gene_type': ['type1', 'type2', 'type1']})
>>> adata_ = anndata.AnnData(var=var)
>>>
>>> # Subset using pandas Index
>>> adata = adata_.copy()
>>> subset_var(adata, pd.Index(['gene2', 'gene3']))
>>> adata.var_names.tolist()
['gene2', 'gene3']
>>>
>>> # Subset using variable names
>>> adata = adata_.copy()
>>> subset_var(adata, ['gene1', 'gene2'])
>>> adata.var_names.tolist()
['gene1', 'gene2']
>>>
>>> # Subset using integer indices
>>> adata = adata_.copy()
>>> subset_var(adata, [0, 1])
>>> adata.var_names.tolist()
['gene1', 'gene2']
>>>
>>> # Subset using boolean mask
>>> adata = adata_.copy()
>>> subset_var(adata, [True, False, True])
>>> adata.var_names.tolist()
['gene1', 'gene3']
>>>
>>> # Subset using query string
>>> adata = adata_.copy()
>>> subset_var(adata, 'gene_type == "type1"')
>>> adata.var_names.tolist()
['gene1', 'gene3']
Notes
  • The function modifies the AnnData object in-place
  • When using a boolean mask, its length must match the number of variables
  • When using integer indices, they must be valid indices for the variables
  • Invalid variable names or indices will raise KeyError or IndexError respectively
  • The order of variables in the output will match the order in the subset parameter

Metadata

transfer_metadata

transfer_metadata(adata: AnnData, group_key: str, source_group: str, column: str, periodic: bool = False, vmin: float = 0, vmax: float = 1, min_neighs: int = 5, weight_by: Literal['connectivity', 'distance', 'constant'] = 'connectivity')

Transfer a metadata column from a source group to the rest of the cells.

Uses the k-nearest-neighbor graph (adata.obsp["connectivities"] and adata.obsp["distances"]) to propagate values from labeled cells (source_group) to unlabeled cells. Results are stored as new columns with the transferred_ prefix.

Parameters:

Name Type Description Default
adata AnnData

Annotated data matrix with a computed neighbor graph. Modified in-place.

required
group_key str

Column in adata.obs identifying the groups (e.g. "batch").

required
source_group str

Value in adata.obs[group_key] whose cells serve as the labeled source. Cells in all other groups receive transferred values.

required
column str

Column in adata.obs containing the values to transfer (numeric, categorical, or boolean).

required
periodic bool

If True, treat column as a periodic variable (e.g. cell-cycle phase in [vmin, vmax]). Default is False.

False
vmin float

Minimum value for periodic wrapping. Default is 0.

0
vmax float

Maximum value for periodic wrapping. Default is 1.

1
min_neighs int

Minimum number of labeled neighbors required to assign a value. Cells with fewer labeled neighbors are left as NaN. Default is 5.

5
weight_by (connectivity, distance, constant)

How to weight neighbors when aggregating values. "connectivity" uses the connectivity matrix; "distance" uses inverse distances; "constant" gives equal weight to all neighbors. Default is "connectivity".

"connectivity"

Returns:

Type Description
None

Adds transferred_{column} and transferred_{column}_error (or transferred_{column}_proportion for categorical columns) to adata.obs.


propagate_metadata

propagate_metadata(adata: AnnData, column: str, periodic: bool = False, vmin: float = 0, vmax: float = 1, min_neighs: int = 5, weight_by: Literal['connectivity', 'distance', 'constant'] = 'connectivity')

Fill missing values in a metadata column by propagation through the neighbor graph.

Cells that already have a value in column are used as anchors; NaN cells receive an estimated value from their labeled neighbors. Useful for imputing partially annotated metadata (e.g. pseudotime or cell-type labels) based on the k-NN graph structure.

Parameters:

Name Type Description Default
adata AnnData

Annotated data matrix with a computed neighbor graph. Modified in-place.

required
column str

Column in adata.obs with partial values (NaNs to be filled).

required
periodic bool

If True, treat the variable as periodic (circular). Default is False.

False
vmin float

Minimum value for periodic wrapping. Default is 0.

0
vmax float

Maximum value for periodic wrapping. Default is 1.

1
min_neighs int

Minimum number of labeled neighbors required to assign a value. Default is 5.

5
weight_by (connectivity, distance, constant)

Neighbor weighting scheme. Default is "connectivity".

"connectivity"

Returns:

Type Description
None

Fills NaN entries in adata.obs[column] in-place and adds an error/proportion column ({column}_error or {column}_proportion).


Utilities

pool_neighbors

pool_neighbors(adata: AnnData, *, key: str | None = None, key_periodic: bool = False, key_min: float | None = None, key_max: float | None = None, n_neighbors: Optional[int] = None, neighbors_key: str = 'neighbors', weighted: bool = False, directed: bool = True, key_added: Optional[str] = None, copy: bool = False) -> csr_matrix | ndarray | None

Given an adjacency matrix, pool cell features using a weighted sum of feature counts from neighboring cells. The weights are the normalized connectivities from the adjacency matrix.

Parameters:

Name Type Description Default
adata AnnData

Annotated data matrix.

required
key str

Key in AnnData object to use for pooling. It can be a key in adata.obs, adata.layers, or adata.obsm. Defaults to None.

None
key_periodic bool

Whether to use periodic boundary conditions for pooling. It is only used if key is a key in adata.obs. Defaults to False.

False
key_min float

Minimum value for column in adata.obs to use for pooling. It is only used if key is a key in adata.obs. Defaults to None. Must be provided if key_periodic is True.

None
key_max float

Maximum value for column in adata.obs to use for pooling. It is only used if key is a key in adata.obs. Defaults to None. Must be provided if key_periodic is True.

None
n_neighbors int

Number of neighbors to consider. Defaults to None.

None
neighbors_key str

Key in AnnData object to use for neighbors. Defaults to None.

'neighbors'
weighted bool

Whether to weight neighbors by their connectivities in the adjacency matrix. Defaults to False.

False
directed bool

Whether to use directed or undirected neighbors. Defaults to True.

True
key_added str

Key to use in AnnData object for the pooled features. Defaults to None.

None
copy bool

Whether to return a copy of the pooled features instead of modifying the original AnnData object. Defaults to False.

False

Returns:

Type Description
csr_matrix | ndarray | None

The pooled features if copy is True, otherwise None.