Transcriptional Data Process

pipeline.code.dpreprocess.ahba_.extract_AHBA_data(atlas_path, atlas_info_path, lr_mirror='bidirectional', gene_norm='srs', sample_norm='srs', return_report=True, return_counts=True, return_donors=True, norm_matched=False, ibf_threshold=0, region_agg=None)[source]

Extracts Allen Human Brain Atlas (AHBA) gene expression data based on a given brain atlas. For more information, please ref https://abagen.readthedocs.io/en/stable/generated/abagen.get_expression_data.html

Parameters:
  • atlas_path (str) – Path to the human brain atlas NIfTI file.

  • atlas_info_path (str) – Path to a CSV file containing atlas region information with columns: ‘Anatomical Name’ and ‘Atlas Index’.

  • lr_mirror ({'left', 'right', 'bidirectional'}, optional) – How to mirror samples across hemispheres. Default is ‘bidirectional’.

  • gene_norm (str, optional) – Method by which to normalize microarray expression values for each donor. Default is ‘srs’.

  • sample_norm (str, optional) – Method by which to normalize microarray expression values for each sample. Default is ‘srs’.

  • return_report (bool, optional) – Whether to return a string containing longform text describing the processing procedures used to generate the expression DataFrames returned by this function. Default is True.

  • return_counts (bool, optional) – Whether to return dataframe containing information on how many samples were assigned to each parcel in atlas for each donor. Default is True.

  • return_donors (bool, optional) – Whether to return donor-level expression arrays instead of aggregating expression across donors with provided agg_metric. Default is True.

  • norm_matched (bool, optional) – Whether to perform gene normalization (gene_norm) across only those samples matched to regions in atlas instead of all available samples. Default is False.

  • ibf_threshold (float, optional) – Intensity-based filtering threshold. Default is 0.

  • region_agg (str or None, optional) – Mechanism by which to reduce sample-level expression data into region-level expression.

Returns:

If return_donors=True, returns a dictionary where keys are donor IDs and values are gene expression DataFrames (regions × genes). If return_donors=False, returns a single aggregated DataFrame.

Return type:

dict[str, pandas.DataFrame] or pandas.DataFrame

Examples

>>> expr_data = extract_AHBA_data(
...     '/path/to/atlas.nii.gz',
...     '/path/to/atlas_info.csv'
... )

class pipeline.code.dpreprocess.sn_.Preprocess_Sn(adata=None, min_genes=200, min_cells=3, total_UMIs=800, log_base=2, target_sum=10000.0, exclude_highly_expressed=True, max_fraction=0.05, n_jobs=60, regress_out=True, combat=False, num_samples=100, max_value=10)[source]

Bases: object

Preprocessing pipeline for single-nucleus RNA-seq data based on the Scanpy framework.

Implements quality control, normalization, batch correction, gene selection, intra-regional smoothing, and scaling, with optional saving of the processed data.

Parameters:
  • adata (AnnData, optional) – Initial AnnData object containing expression matrix.

  • min_genes (int, optional) – Minimum number of genes expressed per cell (default is 200).

  • min_cells (int, optional) – Minimum number of cells a gene must be expressed in (default is 3).

  • total_UMIs (int, optional) – Minimum total UMI count required per cell (default is 800).

  • log_base (float, optional) – Base of logarithm used in log transformation (default is 2).

  • target_sum (float, optional) – Target total count after normalization (default is 1e4).

  • exclude_highly_expressed (bool, optional) – Whether to exclude highly expressed genes during normalization (default is False).

  • max_fraction (float, optional) – Maximum expression fraction to define ‘highly expressed’ genes (default is 0.05).

  • n_jobs (int, optional) – Number of threads for parallel computation (default is 60).

  • regress_out (bool, optional) – Whether to regress out technical covariates (e.g., total UMIs) (default is False).

  • combat (bool, optional) – Whether to perform batch effect correction using ComBat (default is False).

  • num_samples (int, optional) – Number of resampling iterations for intra-region smoothing (default is None).

  • max_value (float, optional) – Maximum value for scaling (clipping) of gene expression (default is 10).

dataset_qc()[source]

Perform quality control filtering on the dataset.

This function filters cells and genes based on quality metrics:

  • Calculates total UMIs per cell and adds it to adata.obs.

  • Filters out cells with fewer than min_genes expressed genes.

  • Filters out genes expressed in fewer than min_cells cells.

  • Removes cells with total UMIs below a threshold total_UMIs.

  • Removes genes with zero expression across all remaining cells.

Returns:

The filtered AnnData object.

Return type:

AnnData

dataset_normalized()[source]

Normalize the single-nucleus RNA-seq dataset.

This method performs total-count normalization to scale counts per cell to a common target sum, optionally excluding highly expressed genes, followed by log-transformation with a specified logarithm base.

Returns:

The normalized AnnData object.

Return type:

AnnData

dataset_common_genes(gene_list=None)[source]

Filter the dataset to keep only genes common to the provided gene list.

Parameters:

gene_list (list, optional) – List of gene names to intersect with the dataset’s genes. If None, no filtering is applied (default is None).

Returns:

The filtered AnnData object containing only the common genes.

Return type:

AnnData

dataset_batch_correct()[source]

Perform batch effect correction. This includes optional regression of total UMIs and batch correction using Combat.

Returns:

The AnnData object after batch correction.

Return type:

AnnData

dataset_smooth()[source]

Perform intra-regional gene expression smoothing.

For each region, perform smoothing by averaging gene expression over random samples of cells within that region.

Returns:

The smoothed AnnData object.

Return type:

AnnData

dataset_scaled(cortical_regions=None, subcortical_regions=None, scale_type='split')[source]

Scale gene expression data, optionally splitting into cortical and subcortical regions.

Parameters:
  • cortical_regions (list, optional) – List of cortical region names to subset and scale separately.

  • subcortical_regions (list, optional) – List of subcortical region names to subset and scale separately.

  • scale_type ({'split', 'global'}, optional) – Scaling method to apply. ‘split’ scales cortical and subcortical regions separately, ‘global’ scales the whole dataset at once. Default is ‘split’.

Returns:

The scaled AnnData object.

Return type:

AnnData

preprocess_pipeline(adata=None, min_genes=200, min_cells=3, total_UMIs=800, log_base=2, target_sum=10000.0, exclude_highly_expressed=True, max_fraction=0.05, n_jobs=60, regress_out=True, combat=False, num_samples=100, max_value=10, dataset_path=None, gene_list=None, cortical_regions=None, subcortical_regions=None, scale_type='split', save_path=None, steps=None, **kwargs)[source]

Preprocess single-nucleus RNA-seq data with flexible pipeline steps.

Parameters:
  • adata (AnnData, optional) – Input AnnData object (.h5ad) containing single-nucleus data. If None, data will be loaded from dataset_path.

  • min_genes (int, default=200) – Minimum number of genes required per cell. Cells with fewer genes will be filtered out.

  • min_cells (int, default=3) – Minimum number of cells in which a gene must be detected. Genes detected in fewer cells will be removed.

  • total_UMIs (int, default=800) – Minimum total UMI counts per cell. Cells below this threshold will be filtered out.

  • log_base (int or float, default=2) – Base of logarithm used for log transformation.

  • target_sum (float, default=1e4) – Target sum for count normalization. After normalization, each cell’s counts sum to this value.

  • exclude_highly_expressed (bool, default=True) – Whether to exclude highly expressed genes during normalization to reduce technical artifacts.

  • max_fraction (float, default=0.05) – Maximum fraction of counts that can come from a single gene to consider it for exclusion.

  • n_jobs (int, default=60) – Number of parallel jobs for computation.

  • regress_out (bool, default=True) – Whether to regress out technical covariates (e.g., total counts).

  • combat (bool, default=False) – Whether to apply ComBat batch correction.

  • num_samples (int, default=100) – Number of samples for downsampling/bootstrap during smoothing.

  • max_value (float, default=10) – Maximum clip threshold for transformed values to avoid extreme outliers.

  • dataset_path (str, optional) – Path to the single-nucleus dataset in .h5ad format, used if adata is None.

  • gene_list (list of str, optional) – List of common genes to filter on.

  • cortical_regions (list of str, optional) – List of cortical region names in the dataset.

  • subcortical_regions (list of str, optional) – List of subcortical region names in the dataset.

  • scale_type ({'split', 'all'}, default='split') – Normalization method for scaling the dataset. ‘split’ scales regions separately, ‘all’ scales all regions together.

  • save_path (str, optional) – File path to save the processed data.

  • steps (dict of str to bool, optional) –

    Dictionary controlling execution of processing steps. Keys and defaults:
    • ’qc’ (bool): Quality control (default: True)

    • ’normalize’ (bool): Data normalization (default: True)

    • ’filter_genes’ (bool): Gene filtering (default: True)

    • ’batch_correct’ (bool): Batch effect correction (default: True)

    • ’smooth’ (bool): Data smoothing (default: True)

    • ’scale’ (bool): Region-specific scaling (default: True)

    • ’save’ (bool): Save processed data (default: True)

  • **kwargs – Additional optional parameters.

Return type:

None