Data Integration

pipeline.code.dintegrate.utils.stable_gene_filter(AHBA_mean, sn_ex, region_, common_genes)[source]

Identifies genes with consistent spatial patterns between AHBA and snRNA-seq data.

Parameters:
  • AHBA_mean (pd.DataFrame) – Regional mean expression matrix from AHBA data. (regions x genes)

  • sn_ex (pd.DataFrame) – Processed snRNA-seq expression (smooth_cells x genes)

  • region (list of str) – List of brain region names to include in the analysis.

  • common_genes (list of str) – List of gene names present in both AHBA and snRNA-seq datasets.

Returns:

DataFrame with genes and their cross-modality spatial correlation scores

Return type:

pd.DataFrame

pipeline.code.dintegrate.utils.calculate_thresholds(ahba_mean, sn_data, align_csv, percentile=0.1)[source]

Calculate regional correlation thresholds between AHBA and single-nucleus data.

Parameters:
  • ahba_mean (pd.DataFrame) – AHBA mean expression matrix (regions x genes), where each row corresponds to a brain region and each column to a gene.

  • sn_data (pd.DataFrame) – Single-nucleus expression data (smooth_cells x genes), with cell-level expression profiles aligned to brain regions.

  • align_csv (pd.DataFrame) – Alignment table containing two columns: ‘brain_region’ for AHBA region names and ‘sn_region’ for corresponding snRNA-seq region labels.

  • percentile (float, optional) – Top percentile used to define the correlation threshold (default is 0.1, i.e., top 10%).

Returns:

A dictionary mapping each AHBA region (str) to a tuple (correlation, p-value), representing the correlation threshold at the given percentile.

Return type:

dict


class pipeline.code.dintegrate.main.DataIntegrator(config)[source]

Bases: object

Single-nucleus dataset and AHBA dataset integration processor.

For weighted integration of Allen Brain Atlas sample expression data with single-nucleus RNA-seq data to construct integrated expression matrices. Supports parallel processing and optimized downsampling.

__init__(config)[source]

Initialize the data integration processor.

Parameters:

config (object) – Configuration object containing parameters for integration.

run()[source]

Execute the integration pipeline.

Return type:

None

Steps:
  • Construct the cell pool if cfg.pool is True.

  • Clean the pool to remove repeatedly assigned cells.

  • Perform weighted integration over multiple iterations.

  • Generate final integrated anndata files.

pool_construction(row)[source]

Retain smooth cells with expression similarity in the top 10% of AHBA regional mean expression values for the sampling pool.

Parameters:

row (pandas.Series) – A row from the alignment CSV representing a brain region mapping.

Return type:

None

clean_pool()[source]

Screening of repeatedly assigned cells.

Cells that are assigned to more than a configured threshold number of regions are removed from the integration pool.

Return type:

None

weighted_average(brain_region)[source]

Perform weighted integration of cell expressions from the pool with samples.

Parameters:

brain_region (str) – The brain region for which integration is performed.

downsample_data(region_pool)[source]

Downsample cells from the constructed pool.

Parameters:

region_pool (pandas.DataFrame) – DataFrame of pooled cells for a brain region.

Returns:

Downsampled pool DataFrame.

Return type:

pandas.DataFrame

generate_integrated_anndata(k_=0)[source]

Generate integrated AnnData object and save as h5ad file.

Reads the integrated data of all brain regions, optionally performs z-score normalization, shuffles samples, and stores the result as an h5ad file.

Parameters:

k (int, optional) – Iteration index for output file naming (default is 0).