Transcriptional Data Process

pipeline.code.dpreprocess.ahba_.extract_AHBA_data(atlas_path, atlas_info_path, lr_mirror='bidirectional', gene_norm='srs', sample_norm='srs', return_report=True, return_counts=True, return_donors=True, norm_matched=False, ibf_threshold=0, region_agg=None)[source]

Extracts Allen Human Brain Atlas (AHBA) gene expression data based on a given brain atlas. For more information, please ref https://abagen.readthedocs.io/en/stable/generated/abagen.get_expression_data.html

Parameters:

atlas_path (str) – Path to the human brain atlas NIfTI file.
atlas_info_path (str) – Path to a CSV file containing atlas region information with columns: ‘Anatomical Name’ and ‘Atlas Index’.
lr_mirror ({'left', 'right', 'bidirectional'}, optional) – How to mirror samples across hemispheres. Default is ‘bidirectional’.
gene_norm (str, optional) – Method by which to normalize microarray expression values for each donor. Default is ‘srs’.
sample_norm (str, optional) – Method by which to normalize microarray expression values for each sample. Default is ‘srs’.
return_report (bool, optional) – Whether to return a string containing longform text describing the processing procedures used to generate the expression DataFrames returned by this function. Default is True.
return_counts (bool, optional) – Whether to return dataframe containing information on how many samples were assigned to each parcel in atlas for each donor. Default is True.
return_donors (bool, optional) – Whether to return donor-level expression arrays instead of aggregating expression across donors with provided agg_metric. Default is True.
norm_matched (bool, optional) – Whether to perform gene normalization (gene_norm) across only those samples matched to regions in atlas instead of all available samples. Default is False.
ibf_threshold (float, optional) – Intensity-based filtering threshold. Default is 0.
region_agg (str or None, optional) – Mechanism by which to reduce sample-level expression data into region-level expression.

Returns:

If return_donors=True, returns a dictionary where keys are donor IDs and values are gene expression DataFrames (regions × genes). If return_donors=False, returns a single aggregated DataFrame.

Return type:

dict[str, pandas.DataFrame] or pandas.DataFrame

Examples

>>> expr_data = extract_AHBA_data(
...     '/path/to/atlas.nii.gz',
...     '/path/to/atlas_info.csv'
... )

class pipeline.code.dpreprocess.sn_.Preprocess_Sn(adata=None, min_genes=200, min_cells=3, total_UMIs=800, log_base=2, target_sum=10000.0, exclude_highly_expressed=True, max_fraction=0.05, n_jobs=60, regress_out=True, combat=False, num_samples=100, max_value=10)[source]

Bases: object

Preprocessing pipeline for single-nucleus RNA-seq data based on the Scanpy framework.

Implements quality control, normalization, batch correction, gene selection, intra-regional smoothing, and scaling, with optional saving of the processed data.

Parameters:

adata (AnnData, optional) – Initial AnnData object containing expression matrix.
min_genes (int, optional) – Minimum number of genes expressed per cell (default is 200).
min_cells (int, optional) – Minimum number of cells a gene must be expressed in (default is 3).
total_UMIs (int, optional) – Minimum total UMI count required per cell (default is 800).
log_base (float, optional) – Base of logarithm used in log transformation (default is 2).
target_sum (float, optional) – Target total count after normalization (default is 1e4).
exclude_highly_expressed (bool, optional) – Whether to exclude highly expressed genes during normalization (default is False).
max_fraction (float, optional) – Maximum expression fraction to define ‘highly expressed’ genes (default is 0.05).
n_jobs (int, optional) – Number of threads for parallel computation (default is 60).
regress_out (bool, optional) – Whether to regress out technical covariates (e.g., total UMIs) (default is False).
combat (bool, optional) – Whether to perform batch effect correction using ComBat (default is False).
num_samples (int, optional) – Number of resampling iterations for intra-region smoothing (default is None).
max_value (float, optional) – Maximum value for scaling (clipping) of gene expression (default is 10).

dataset_qc()[source]

Perform quality control filtering on the dataset.

This function filters cells and genes based on quality metrics:

Calculates total UMIs per cell and adds it to adata.obs.

Filters out cells with fewer than min_genes expressed genes.

Filters out genes expressed in fewer than min_cells cells.

Removes cells with total UMIs below a threshold total_UMIs.

Removes genes with zero expression across all remaining cells.

Returns:: The filtered AnnData object.
Return type:: AnnData

dataset_normalized()[source]

Normalize the single-nucleus RNA-seq dataset.

This method performs total-count normalization to scale counts per cell to a common target sum, optionally excluding highly expressed genes, followed by log-transformation with a specified logarithm base.

Returns:: The normalized AnnData object.
Return type:: AnnData

dataset_common_genes(gene_list=None)[source]

Filter the dataset to keep only genes common to the provided gene list.

Parameters:: gene_list (list, optional) – List of gene names to intersect with the dataset’s genes. If None, no filtering is applied (default is None).
Returns:: The filtered AnnData object containing only the common genes.
Return type:: AnnData

dataset_batch_correct()[source]

Perform batch effect correction. This includes optional regression of total UMIs and batch correction using Combat.

Returns:: The AnnData object after batch correction.
Return type:: AnnData

dataset_smooth()[source]

Perform intra-regional gene expression smoothing.

For each region, perform smoothing by averaging gene expression over random samples of cells within that region.

Returns:: The smoothed AnnData object.
Return type:: AnnData

dataset_scaled(cortical_regions=None, subcortical_regions=None, scale_type='split')[source]

Scale gene expression data, optionally splitting into cortical and subcortical regions.

Parameters:

cortical_regions (list, optional) – List of cortical region names to subset and scale separately.
subcortical_regions (list, optional) – List of subcortical region names to subset and scale separately.
scale_type ({'split', 'global'}, optional) – Scaling method to apply. ‘split’ scales cortical and subcortical regions separately, ‘global’ scales the whole dataset at once. Default is ‘split’.

Returns:

The scaled AnnData object.

Return type:

AnnData

preprocess_pipeline(adata=None, min_genes=200, min_cells=3, total_UMIs=800, log_base=2, target_sum=10000.0, exclude_highly_expressed=True, max_fraction=0.05, n_jobs=60, regress_out=True, combat=False, num_samples=100, max_value=10, dataset_path=None, gene_list=None, cortical_regions=None, subcortical_regions=None, scale_type='split', save_path=None, steps=None, **kwargs)[source]

Preprocess single-nucleus RNA-seq data with flexible pipeline steps.

Parameters:

adata (AnnData, optional) – Input AnnData object (.h5ad) containing single-nucleus data. If None, data will be loaded from dataset_path.
min_genes (int, default=200) – Minimum number of genes required per cell. Cells with fewer genes will be filtered out.
min_cells (int, default=3) – Minimum number of cells in which a gene must be detected. Genes detected in fewer cells will be removed.
total_UMIs (int, default=800) – Minimum total UMI counts per cell. Cells below this threshold will be filtered out.
log_base (int or float, default=2) – Base of logarithm used for log transformation.
target_sum (float, default=1e4) – Target sum for count normalization. After normalization, each cell’s counts sum to this value.
exclude_highly_expressed (bool, default=True) – Whether to exclude highly expressed genes during normalization to reduce technical artifacts.
max_fraction (float, default=0.05) – Maximum fraction of counts that can come from a single gene to consider it for exclusion.
n_jobs (int, default=60) – Number of parallel jobs for computation.
regress_out (bool, default=True) – Whether to regress out technical covariates (e.g., total counts).
combat (bool, default=False) – Whether to apply ComBat batch correction.
num_samples (int, default=100) – Number of samples for downsampling/bootstrap during smoothing.
max_value (float, default=10) – Maximum clip threshold for transformed values to avoid extreme outliers.
dataset_path (str, optional) – Path to the single-nucleus dataset in .h5ad format, used if adata is None.
gene_list (list of str, optional) – List of common genes to filter on.
cortical_regions (list of str, optional) – List of cortical region names in the dataset.
subcortical_regions (list of str, optional) – List of subcortical region names in the dataset.
scale_type ({'split', 'all'}, default='split') – Normalization method for scaling the dataset. ‘split’ scales regions separately, ‘all’ scales all regions together.
save_path (str, optional) – File path to save the processed data.
steps (dict of str to bool, optional) –
Dictionary controlling execution of processing steps. Keys and defaults:
- ’qc’ (bool): Quality control (default: True)
- ’normalize’ (bool): Data normalization (default: True)
- ’filter_genes’ (bool): Gene filtering (default: True)
- ’batch_correct’ (bool): Batch effect correction (default: True)
- ’smooth’ (bool): Data smoothing (default: True)
- ’scale’ (bool): Region-specific scaling (default: True)
- ’save’ (bool): Save processed data (default: True)
**kwargs – Additional optional parameters.

Return type:

None