Tools¶

Advanced analysis functions available via sclab.tools.

Pseudotime & Trajectory (cellflow)¶

pseudotime ¶

pseudotime(adata: AnnData, use_rep: str, t_key: str, t_range: tuple[float, float], n_dims: int = 10, min_snr: float = 0.25, periodic: bool = False, method: Literal['fourier', 'splines'] = 'splines', largest_harmonic: int = 5, roughness: float | None = None, key_added='pseudotime') -> PseudotimeResult

Compute pseudotime ordering for cells by fitting a curve through a low-dimensional embedding.

Fits either a Fourier series or smoothing spline to a reduced-dimensional representation of the data, then projects each cell onto the nearest point along the fitted curve. The arc-length along that curve is used as the pseudotime coordinate, normalised to the range [0, 1].

Parameters:

Name	Type	Description	Default
`adata`	`AnnData`	Annotated data matrix. Must contain `adata.obsm[use_rep]` and `adata.obs[t_key]`.	required
`use_rep`	`str`	Key in `adata.obsm` containing the low-dimensional embedding (e.g. `"X_pca"`) used to fit the pseudotime curve.	required
`t_key`	`str`	Key in `adata.obs` that holds an initial continuous ordering of cells (e.g. a coarse time label or an existing pseudotime estimate) used to initialise the curve fit.	required
`t_range`	`tuple[float, float]`	`(t_min, t_max)` interval of `t_key` values to consider. Cells outside this range are excluded from fitting and their pseudotime is set to `NaN`.	required
`n_dims`	`int`	Maximum number of embedding dimensions to use for the curve fit. Default is 10.	`10`
`min_snr`	`float`	Minimum signal-to-noise ratio (relative to the dimension with the highest SNR) required to include a dimension in the fit. Dimensions below this threshold are discarded. Default is 0.25.	`0.25`
`periodic`	`bool`	If `True`, treat the trajectory as periodic (cyclic). Requires `t_range[0] == 0.0` and `method="fourier"` or `method="splines"` with periodic boundary conditions. Default is `False`.	`False`
`method`	`(splines, fourier)`	Curve-fitting method. `"splines"` fits an N-D smoothing spline; `"fourier"` fits an N-D Fourier series (only valid when `periodic=True`). Default is `"splines"`.	`"splines"`
`largest_harmonic`	`int`	Highest harmonic to include when `method="fourier"`. Ignored for `method="splines"`. Default is 5.	`5`
`roughness`	`float or None`	Roughness penalty for the smoothing spline when `method="splines"`. If `None`, an automatic penalty is chosen. Default is `None`.	`None`
`key_added`	`str`	Base key under which results are stored. Default is `"pseudotime"`. The following entries are written to `adata`: `adata.obs[key_added]` -- arc-length pseudotime in [0, 1]. `adata.obs[key_added + "_path_residue"]` -- Euclidean distance from each cell to its nearest point on the fitted curve. `adata.obsm[key_added + "_path"]` -- fitted curve coordinates evaluated at each cell's projected pseudotime. `adata.obsm[key_added + "_path_derivative"]` -- first derivative of the fitted curve at each cell's projected pseudotime. `adata.uns[key_added]` -- dictionary of run parameters and SNR values.	`'pseudotime'`

Returns:

Type Description

PseudotimeResult

A named tuple with the following fields:

pseudotime -- arc-length pseudotime values for cells within t_range, normalised to [0, 1].
residues -- Euclidean residuals between each cell and its nearest curve point.
phi -- raw parameter values (in the original t_key units) of the nearest curve point for each cell.
F -- fitted curve object (NDBSpline or NDFourier) defined over the full embedding dimensionality.
SNR -- per-dimension signal-to-noise ratios, normalised so the maximum is 1.
snr_mask -- boolean mask indicating which dimensions passed the min_snr threshold.
t_mask -- boolean mask indicating which cells fall within t_range.
fp_resolution -- floating-point resolution used during the final pseudotime refinement stage.

Notes

Results for cells outside t_range are stored as NaN in adata.obs. The curve is fitted only on cells whose t_key value lies within [t_min, t_max].

density_dynamics ¶

density_dynamics(adata: AnnData, time_key: str = 'pseudotime', t_range: tuple[float, float] | None = None, periodic: bool | None = None, bandwidth: float = 1 / 64, algorithm: str = 'auto', kernel: str = 'gaussian', metric: str = 'euclidean', max_grid_size: int = 2 ** 8 + 1, derivative: int = 0, mode: Literal['peaks', 'valleys'] = 'peaks', find_peaks_kwargs: dict = {}, plot_density: bool = False, plot_density_fit: bool = False, plot_density_fit_derivative: bool = False, plot_histogram: bool = False, histogram_nbins: int = 50)

Detect density peaks or valleys along pseudotime via B-spline fitting.

Fits a KDE to the pseudotime distribution, smooths it with a B-spline, optionally takes a derivative of the spline, and identifies peaks (or valleys) using :func:scipy.signal.find_peaks. Detected peak times, heights, and inter-peak durations are stored in adata.uns.

Parameters:

Name	Type	Description	Default
`adata`	`AnnData`	Annotated data matrix. Must contain pseudotime values in `adata.obs[time_key]` and `adata.uns[time_key]['t_range']`.	required
`time_key`	`str`	Column in `adata.obs` that holds pseudotime values and key under which results are stored in `adata.uns`. Default is `"pseudotime"`.	`'pseudotime'`
`t_range`	`tuple of float`	`(t_min, t_max)` domain for the density estimate. When None, the value stored in `adata.uns[time_key]['t_range']` is used (an `AssertionError` is raised if that key is absent). Default is None.	`None`
`periodic`	`bool`	Whether pseudotime is periodic. When None, inferred from `adata.uns[time_key]['periodic']` if available, otherwise False. Default is None.	`None`
`bandwidth`	`float`	Bandwidth for the KDE. Default is `1/64`.	`1 / 64`
`algorithm`	`str`	Algorithm passed to the KDE back-end. Default is `"auto"`.	`'auto'`
`kernel`	`str`	Kernel function for the KDE. Default is `"gaussian"`.	`'gaussian'`
`metric`	`str`	Distance metric for the KDE. Default is `"euclidean"`.	`'euclidean'`
`max_grid_size`	`int`	Number of grid points for KDE evaluation. Default is `2**8 + 1`.	`2 ** 8 + 1`
`derivative`	`int`	Order of the B-spline derivative to analyse. `0` analyses the density itself; `1` analyses its rate of change, etc. Default is `0`.	`0`
`mode`	`(peaks, valleys)`	Whether to detect peaks or valleys in the (derivative of the) density. Default is `"peaks"`.	`"peaks"`
`find_peaks_kwargs`	`dict`	Extra keyword arguments forwarded to :func:`scipy.signal.find_peaks`. The `'height'` key, if present, is treated as a fraction of the global maximum and rescaled accordingly. Default is `{}`.	`{}`
`plot_density`	`bool`	If True, plot the raw KDE. Default is False.	`False`
`plot_density_fit`	`bool`	If True, plot the smoothed B-spline fit. Default is False.	`False`
`plot_density_fit_derivative`	`bool`	If True, plot the derivative of the B-spline. Default is False.	`False`
`plot_histogram`	`bool`	If True, overlay a histogram on the plot. Default is False.	`False`
`histogram_nbins`	`int`	Number of histogram bins. Default is `50`.	`50`

Returns:

Type Description

None

Modifies adata in-place. Results are stored under adata.uns[time_key][f'density_dynamics_d{derivative}_{mode}'] as a dict with keys:

'times' — pseudotime positions of detected peaks.
'deltas' — inter-peak durations (or phase durations for periodic data).
'heights' — density (or derivative) values at each peak.
'params' — KDE and peak-finding hyper-parameters.
'density_bspline_tck' — B-spline representation of the fitted density.

expression_dynamics ¶

expression_dynamics(adata: AnnData, time_key: str, t_range: tuple[float, float] | None = None, periodic: bool | None = None, layer: str | None = None, gene_mask: str | None = None, n_grid: int = 1001, progress: bool = False)

Compute per-cell gene turnover from expression dynamics over pseudotime.

Fits a smooth B-spline to the expression matrix over pseudotime, takes the analytical derivative (dX/dt), then counts the number of genes with high activation (rate > median of positives) and high repression (rate < median of negatives) for each cell.

Additionally computes per-gene timing summaries (pseudotime of peak activation, peak repression, acceleration onset, and deceleration onset) and a per-cell transcriptional flux (total absolute velocity across genes).

Parameters:

Name	Type	Description	Default
`adata`	`AnnData`	Annotated data matrix. Must contain pseudotime values in `adata.obs[time_key]`.	required
`time_key`	`str`	Column in `adata.obs` with pseudotime values. If `adata.uns[time_key]` exists, `t_range` and `periodic` are read from it when not explicitly provided.	required
`t_range`	`tuple[float, float] \| None`	Min and max pseudotime for the spline domain. Inferred from `adata.uns[time_key]['t_range']` or the data range if None.	`None`
`periodic`	`bool \| None`	Whether pseudotime is periodic (e.g. cell cycle). Inferred from `adata.uns[time_key]['periodic']` or defaults to False.	`None`
`layer`	`str \| None`	Layer in `adata.layers` to use as expression matrix. Uses `adata.X` when None.	`None`
`gene_mask`	`str \| None`	Boolean column in `adata.var` to subset genes before fitting. When provided, output columns are prefixed with `{gene_mask}_` instead of the defaults.	`None`
`n_grid`	`int`	Number of evenly spaced points over `t_range` used to locate per-gene derivative extrema. Higher values give more precise timing estimates at modest computational cost.	`1001`
`progress`	`bool`	Show a progress bar during spline fitting.	`False`

Returns:

Type Description

None

Modifies adata in-place.

obs columns (per-cell):

n_activation / {gene_mask}_up — number of genes with velocity above the median of all positive velocities.
n_repression / {gene_mask}_dw — number of genes with velocity below the median of all negative velocities.
transcriptional_flux / {gene_mask}_flux — sum of absolute velocities across genes.

var columns (per-gene, restricted to gene_mask rows when provided):

peak_activation_t / {gene_mask}_peak_activation_t — pseudotime of maximum first derivative.
peak_repression_t / {gene_mask}_peak_repression_t — pseudotime of minimum first derivative.
acceleration_onset_t / {gene_mask}_acceleration_onset_t — pseudotime of maximum second derivative.
deceleration_onset_t / {gene_mask}_deceleration_onset_t — pseudotime of minimum second derivative.

real_time ¶

real_time(adata: AnnData, pseudotime_key: str = 'pseudotime', pseudotime_t_range: tuple[float, float] | None = None, periodic: bool | None = None, key_added: str = 'real_time', tmax: float = 100, units: Literal['minutes', 'hours', 'days', 'percent'] = 'percent', bandwidth: float = 1 / 64, algorithm: str = 'auto', kernel: str = 'gaussian', metric: str = 'euclidean', max_grid_size: int = 2 ** 8 + 1, plot_density: bool = False, plot_density_fit: bool = False, plot_density_fit_derivative: bool = False, plot_histogram: bool = False, histogram_nbins: int = 50)

Convert pseudotime to real time by normalising for cell-cycle density.

Fits a density profile along pseudotime (via :func:density) and then maps each cell's pseudotime to a real-time value by integrating the inverse of the density curve (area-under-curve normalisation). This corrects for non-uniform sampling across the trajectory so that equal real-time intervals contain proportionally equal numbers of cells.

Parameters:

Name	Type	Description	Default
`adata`	`AnnData`	Annotated data matrix. Must contain pseudotime values in `adata.obs[pseudotime_key]`.	required
`pseudotime_key`	`str`	Column in `adata.obs` with pseudotime values. Default is `"pseudotime"`.	`'pseudotime'`
`pseudotime_t_range`	`tuple of float`	`(t_min, t_max)` domain of the pseudotime axis. When None, inferred from the data via :func:`density`. Default is None.	`None`
`periodic`	`bool`	Whether pseudotime is periodic. When None, inferred from `adata.uns[pseudotime_key]['periodic']` if available, otherwise False. Default is None.	`None`
`key_added`	`str`	Column in `adata.obs` and key in `adata.uns` under which the real-time values and metadata are stored. Default is `"real_time"`.	`'real_time'`
`tmax`	`float`	Maximum real-time value (upper bound of the output axis). Cells at the very end of the trajectory are mapped to this value. Default is `100`.	`100`
`units`	`(minutes, hours, days, percent)`	Interpretive label for the real-time axis; stored in `adata.uns[key_added]['t_units']` but does not affect the computation. Default is `"percent"`.	`"minutes"`
`bandwidth`	`float`	Bandwidth for the KDE. Default is `1/64`.	`1 / 64`
`algorithm`	`str`	Algorithm passed to the KDE back-end. Default is `"auto"`.	`'auto'`
`kernel`	`str`	Kernel function for the KDE. Default is `"gaussian"`.	`'gaussian'`
`metric`	`str`	Distance metric for the KDE. Default is `"euclidean"`.	`'euclidean'`
`max_grid_size`	`int`	Number of grid points for KDE evaluation. Default is `2**8 + 1`.	`2 ** 8 + 1`
`plot_density`	`bool`	If True, plot the raw KDE. Default is False.	`False`
`plot_density_fit`	`bool`	If True, plot the smoothed B-spline fit. Default is False.	`False`
`plot_density_fit_derivative`	`bool`	If True, plot the derivative of the B-spline. Default is False.	`False`
`plot_histogram`	`bool`	If True, overlay a histogram on the plot. Default is False.	`False`
`histogram_nbins`	`int`	Number of histogram bins. Default is `50`.	`50`

Returns:

Type	Description
`None`	Modifies adata in-place: `adata.obs[key_added]` — real-time values for each cell. Cells outside `pseudotime_t_range` are assigned NaN. `adata.uns[key_added]` — dict containing fitting parameters, the B-spline TCK representation, `'tmax'`, `'t_range'`, `'t_units'`, and `'periodic'`.

piecewise_rescale ¶

piecewise_rescale(adata: AnnData, time_key: str, groupby: str, groups: Sequence[str], durations: list[float] | dict[str, float], new_key: str = 'real_time', periodic: bool = False, t_range: tuple[float, float] | None = None) -> None

Rescale pseudotime to real-time using piecewise linear mapping.

Parameters:

Name	Type	Description	Default
`adata`	`AnnData`	Annotated data matrix.	required
`time_key`	`str`	Key in `adata.obs` for pseudotime.	required
`groupby`	`str`	Key in `adata.obs` for categorical labels used to define intervals.	required
`groups`	`Sequence[str]`	Ordered list of category labels to include in the scaling. Cells belonging to other categories will be assigned NaN.	required
`durations`	`list[float] \| dict[str, float]`	Durations for each interval defined by `groups`. If a list, must match number of intervals (len(groups)). If a dictionary, must map category labels to durations.	required
`new_key`	`str`	Key in `adata.obs` to store the rescaled real-time values.	`'real_time'`
`periodic`	`bool`	Whether the trajectory is periodic.	`False`
`t_range`	`tuple[float, float] \| None`	Range of pseudotime. If None, inferred from `adata.obs[time_key]`.	`None`

Doublet Detection¶

scrublet ¶

scrublet(adata: AnnData, layer: str = 'X', key_added: str = 'scrublet', total_counts: ndarray | None = None, sim_doublet_ratio: float = 2.0, n_neighbors: int = None, expected_doublet_rate: float = 0.1, stdev_doublet_rate: float = 0.02, random_state: int = 0, scrub_doublets_kwargs: dict[str, Any] = dict(synthetic_doublet_umi_subsampling=1.0, use_approx_neighbors=True, distance_metric='euclidean', get_doublet_neighbor_parents=False, min_counts=3, min_cells=3, min_gene_variability_pctl=85, log_transform=False, mean_center=True, normalize_variance=True, n_prin_comps=30, svd_solver='arpack', verbose=True))

Detect doublet cells using Scrublet.

Simulates synthetic doublets from the observed count matrix and uses a k-NN classifier to assign each cell a doublet score. Cells are then labelled as "singlet" or "doublet".

Requires scrublet to be installed (pip install scrublet).

Parameters:

Name	Type	Description	Default
`adata`	`AnnData`	Annotated data matrix. Modified in-place.	required
`layer`	`str`	Layer to use as the count matrix. Use `"X"` for `adata.X`. Default is `"X"`.	`'X'`
`key_added`	`str`	Prefix for the columns added to `adata.obs`. Results are stored as `{key_added}_score` and `{key_added}_label`. Default is `"scrublet"`.	`'scrublet'`
`total_counts`	`ndarray or None`	Pre-computed per-cell total counts. If None, Scrublet computes them internally. Default is None.	`None`
`sim_doublet_ratio`	`float`	Number of synthetic doublets to simulate relative to the number of observed cells. Default is 2.0.	`2.0`
`n_neighbors`	`int or None`	Number of neighbors used to classify doublets. If None, Scrublet uses a heuristic based on the number of cells. Default is None.	`None`
`expected_doublet_rate`	`float`	Expected fraction of doublets in the dataset. Default is 0.1.	`0.1`
`stdev_doublet_rate`	`float`	Uncertainty in the expected doublet rate. Default is 0.02.	`0.02`
`random_state`	`int`	Random seed for reproducibility. Default is 0.	`0`
`scrub_doublets_kwargs`	`dict`	Additional keyword arguments forwarded to :meth:`scrublet.Scrublet.scrub_doublets`.	`dict(synthetic_doublet_umi_subsampling=1.0, use_approx_neighbors=True, distance_metric='euclidean', get_doublet_neighbor_parents=False, min_counts=3, min_cells=3, min_gene_variability_pctl=85, log_transform=False, mean_center=True, normalize_variance=True, n_prin_comps=30, svd_solver='arpack', verbose=True)`

Returns:

Type	Description
`None`	Adds the following columns to `adata.obs`: `{key_added}_score` (float): Doublet score for each cell. `{key_added}_label` (Categorical): `"singlet"` or `"doublet"`.

doubletdetection ¶

doubletdetection(adata: AnnData, layer: str = 'X', key_added: str = 'doubletdetection', boost_rate=0.25, n_components=30, n_top_var_genes=10000, replace=False, clustering_algorithm='phenograph', clustering_kwargs=None, n_iters=10, normalizer=None, pseudocount=0.1, random_state=0, verbose=False, standard_scaling=False, n_jobs=1) -> None

scdblfinder ¶

scdblfinder(adata: AnnData, layer: str = 'X', key_added: str = 'scDblFinder', clusters_col: str | bool | None = None, samples_col: str | None = None, clust_cor: ndarray | int | None = None, artificial_doublets: int | None = None, known_doublets_col: int | None = None, known_use: Literal['discard', 'positive'] = 'discard', dbr: float | None = None, dbr_sd: float | None = None, nfeatures: int = 1352, dims: int = 20, k: int | None = None, remove_unidentifiable: bool = True, include_pcs: int = 19, prop_random=0, prop_markers=0, aggregate_features: bool = False, score: Literal['xgb', 'weighted', 'ratio'] = 'xgb', processing: str = 'default', metric: str = 'logloss', nrounds: float = 0.25, max_depth: int = 4, iter: int = 3, training_features: list[str] | None = None, unident_th: float | None = None, multi_sample_mode: Literal['split', 'singleModel', 'singleModelSplitThres', 'asOne'] = 'split', threshold: bool = True, verbose: bool = True, random_state: int = 31415)

Cell Type Labeling¶

classify_cells ¶

classify_cells(adata: AnnData, markers: DataFrame, marker_class_key: Optional[str] = None, cluster_key: Optional[str] = None, layer: Optional[str] = None, key_added: Optional[str] = None, threshold: float = 0.25, penalize_non_specific: bool = True, neighbors_key: Optional[str] = None, save_scores: bool = False)

Classify cells based on a set of marker genes.

Ianevski, A., Giri, A.K. & Aittokallio, T. Fully-automated and ultra-fast cell-type identification using specific marker combinations from single-cell transcriptomic data. Nat Commun 13, 1246 (2022). https://doi.org/10.1038/s41467-022-28803-w

Parameters:

Name	Type	Description	Default
`adata`	`AnnData`	AnnData object.	required
`markers`	`DataFrame`	Marker genes.	required
`marker_class_key`	`Optional[str]`	Column in `markers` that contains the cell type information.	`None`
`cluster_key`	`Optional[str]`	Column in `adata.obs` that contains the cluster information. If not provided, the classification will be performed on a cell by cell basis, pooling across neighbor cells. This pooling can be avoided by setting `force_pooling` to `False`.	`None`
`layer`	`Optional[str]`	Layer to use for classification. Defaults to `X`.	`None`
`key_added`	`Optional[str]`	Key under which to add the classification information.	`None`
`threshold`	`float`	Confidence threshold for classification. Defaults to `0.25`.	`0.25`
`penalize_non_specific`	`bool`	Whether to penalize non-specific markers. Defaults to `True`.	`True`
`neighbors_key`	`Optional[str]`	If provided, counts will be pooled across neighbor cells using the distances in `adata.uns[neighbors_key]["distances"]`. Defaults to `None`.	`None`
`save_scores`	`bool`	Whether to save the classification scores. Defaults to `False`	`False`

Returns:

Type Description

None

Results are written in place to adata.obs[key_added] (category dtype, pd.NA for calls below threshold) and adata.obs[key_added + "_noNA"] (best-guess label regardless of confidence). key_added defaults to marker_class_key when not given. If save_scores=True, also writes adata.obs[key_added + "_score"] (max per-cell confidence score) and adata.obsm[key_added + "_scores"] (full class-by-cell score matrix).

Differential Expression¶

pseudobulk_edger ¶

pseudobulk_edger(adata_: AnnData, group_key: str, condition_group: str | list[str] | None = None, reference_group: str | None = None, cell_identity_key: str | None = None, batch_key: str | None = None, layer: str | None = None, replicas_per_group: int = 5, min_cells_per_group: int = 30, bootstrap_sampling: bool = False, use_cells: dict[str, list[str]] | None = None, aggregate: bool = True, verbosity: int = 0) -> dict[str, DataFrame]

Fits a model using edgeR and computes top tags for a given condition vs reference group.

Parameters:

Name	Type	Description	Default
`adata_`	`AnnData`	Annotated data matrix.	required
`group_key`	`str`	Key in AnnData object to use to group cells.	required
`condition_group`	`str \| list[str] \| None`	Condition group to compare to reference group. If None, each group will be contrasted to the corresponding reference group.	`None`
`reference_group`	`str \| None`	Reference group to compare condition group(s) to. If None, the condition group is compared to the rest of the cells.	`None`
`cell_identity_key`	`str \| None`	If provided, separate contrasts will be computed for each identity. Defaults to None.	`None`
`layer`	`str \| None`	Layer in AnnData object to use. EdgeR requires raw counts. Defaults to None.	`None`
`replicas_per_group`	`int`	Number of replicas to create for each group. Defaults to 10.	`5`
`min_cells_per_group`	`int`	Minimum number of cells required for a group to be included. Defaults to 30.	`30`
`bootstrap_sampling`	`bool`	Whether to use bootstrap sampling to create replicas. Defaults to True.	`False`
`use_cells`	`dict[str, list[str]] \| None`	If not None, only use the specified cells. Defaults to None. Dictionary key is a categorical variable in the obs dataframe and the dictionary value is a list of categories to include.	`None`
`aggregate`	`bool`	Whether to aggregate cells before fitting the model. EdgeR requires a small number of samples, so if adata_ is a single-cell experiment, the cells should be aggregated. Defaults to True.	`True`
`verbosity`	`int`	Verbosity level. Defaults to 0.	`0`

Returns:

Type Description

dict[str, DataFrame]

Dictionary of dataframes, one for each contrast, with the following columns:

gene_ids : str Gene IDs.
logFC : float Log2 fold change.
logCPM : float Log2 CPM.
F: float F-statistic.
PValue : float p-value.
FDR : float False discovery rate.
pct_expr_cnd : float Percentage of cells in condition group expressing the gene.
pct_expr_ref : float Percentage of cells in reference group expressing the gene.

pseudobulk_limma ¶

pseudobulk_limma(adata: AnnData, group_key: str, condition_group: str | list[str] | None = None, reference_group: str | None = None, cell_identity_key: str | None = None, batch_key: str | None = None, layer: str | None = None, replicas_per_group: int = 5, min_cells_per_group: int = 30, bootstrap_sampling: bool = False, use_cells: dict[str, list[str]] | None = None, aggregate: bool = True, verbosity: int = 0) -> dict[str, DataFrame]

Pseudobulk differential expression analysis using limma-voom.

Aggregates single cells into pseudobulk samples, then fits a linear model with limma-voom (via R) and computes top-table statistics for each requested contrast.

Requires R with the packages limma, edgeR, MAST, and SingleCellExperiment, as well as the Python packages rpy2 and anndata2ri.

Parameters:

Name	Type	Description	Default
`adata`	`AnnData`	Annotated data matrix.	required
`group_key`	`str`	Column in `adata.obs` defining the experimental groups.	required
`condition_group`	`str or list of str or None`	Group(s) to test against `reference_group`. If None, each group is contrasted with the corresponding reference. Default is None.	`None`
`reference_group`	`str or None`	Reference group for contrasts. If None, each condition group is contrasted with all remaining cells. Default is None.	`None`
`cell_identity_key`	`str or None`	Column in `adata.obs` for stratifying contrasts by cell type or identity. Separate DE results are returned per identity. Default is None.	`None`
`batch_key`	`str or None`	Column in `adata.obs` to include as a covariate in the design matrix for batch correction. Default is None.	`None`
`layer`	`str or None`	Layer containing raw counts required by limma/edgeR. Uses `adata.X` if None. Default is None.	`None`
`replicas_per_group`	`int`	Number of pseudobulk replicas to create per group. Default is 5.	`5`
`min_cells_per_group`	`int`	Minimum number of cells required for a group to be included. Default is 30.	`30`
`bootstrap_sampling`	`bool`	If True, use bootstrap sampling when creating pseudobulk replicas. Default is False.	`False`
`use_cells`	`dict or None`	Restrict analysis to specific cell subsets. Keys are `adata.obs` columns and values are lists of categories to include. Default is None.	`None`
`aggregate`	`bool`	If True, aggregate cells into pseudobulk samples before fitting. Default is True.	`True`
`verbosity`	`int`	Verbosity level (0 = silent). Default is 0.	`0`

Returns:

Type	Description
`dict of str to pd.DataFrame`	One DataFrame per contrast (keyed by contrast label), with columns: `logFC` — log2 fold change. `AveExpr` — average log2 expression. `t` — moderated t-statistic. `P.Value` — raw p-value. `adj.P.Val` — Benjamini-Hochberg adjusted p-value. `B` — log-odds of differential expression. `pct_expr_cnd` / `pct_expr_ref` — fraction of expressing cells in condition/reference group.

Utilities¶

aggregate_and_filter ¶

aggregate_and_filter(adata: AnnData, group_key: str = 'batch', cell_identity_key: str | None = None, layer: str | None = None, replicas_per_group: int = 3, min_cells_per_group: int = 30, bootstrap_sampling: bool = False, use_cells: dict[str, list[str]] | None = None, make_stats: bool = True, make_dummies: bool = True) -> AnnData

Aggregate and filter cells in an AnnData object into cell populations.

Parameters:

Name	Type	Description	Default
`adata`	`AnnData`	AnnData object to aggregate and filter.	required
`group_key`	`str`	Key to group cells by. Defaults to 'batch'.	`'batch'`
`cell_identity_key`	`str`	Key to use to identify cell identities. Defaults to None.	`None`
`layer`	`str`	Layer in AnnData object to use for aggregation. Defaults to None.	`None`
`replicas_per_group`	`int`	Number of replicas to create for each group. Defaults to 3.	`3`
`min_cells_per_group`	`int`	Minimum number of cells required for a group to be included. Defaults to 30.	`30`
`bootstrap_sampling`	`bool`	Whether to use bootstrap sampling to create replicas. Defaults to False.	`False`
`use_cells`	`dict[str, list[str]]`	If not None, only use the specified cells. Defaults to None.	`None`
`make_stats`	`bool`	Whether to create expression statistics for each group. Defaults to True.	`True`
`make_dummies`	`bool`	Whether to make categorical columns into dummies. Defaults to True.	`True`

Returns:

Type	Description
`AnnData`	AnnData object with aggregated and filtered cells.

call_differential_expression ¶

call_differential_expression(table: DataFrame, pvalue_col: str, logfc_col: str, pct_col_prefix: str = 'pct_', max_pval: float = 0.05, min_robust_z_level: float = 2.5, min_pct: float = 0.05, contrast_key: str | None = None, copy: bool = False)

Call differentially expressed genes from robust Z-scores of log-fold-change.

Flags each row in table as up- or down-regulated by combining a p-value cutoff, a robust Z-score threshold on logfc_col (median-absolute-deviation based; DOI: 10.1080/01621459.1993.10476408), and a minimum expression percentage across any pct_*-prefixed column.

Parameters:

Name	Type	Description	Default
`table`	`DataFrame`	Differential expression results, one row per gene/feature.	required
`pvalue_col`	`str`	Column in `table` with (adjusted) p-values.	required
`logfc_col`	`str`	Column in `table` with log-fold-changes.	required
`pct_col_prefix`	`str`	Prefix of columns giving percent-expressed values; the max across all matching columns is used as the expression filter. Defaults to "pct_".	`'pct_'`
`max_pval`	`float`	Maximum p-value for a gene to be called DE. Defaults to 0.05.	`0.05`
`min_robust_z_level`	`float`	Minimum absolute robust Z-score of `logfc_col` for a gene to be called DE. Defaults to 2.5.	`2.5`
`min_pct`	`float`	Minimum percent-expressed value for a gene to be called DE. Defaults to 0.05.	`0.05`
`contrast_key`	`str or None`	If provided, robust Z-scores are computed independently within each `table.groupby(contrast_key)` group rather than across the whole table. Defaults to None.	`None`
`copy`	`bool`	If True, operate on and return a copy of `table` instead of modifying it in place. Defaults to False.	`False`

Returns:

Type	Description
`DataFrame or None`	`table` with two new columns: `robust_Z` (the computed robust Z-score) and `DE` (1 up-regulated, -1 down-regulated, 0 not significant). Returned only if `copy=True`; otherwise `table` is modified in place and `None` is returned.