Data Integration
- pipeline.code.dintegrate.utils.stable_gene_filter(AHBA_mean, sn_ex, region_, common_genes)[source]
Identifies genes with consistent spatial patterns between AHBA and snRNA-seq data.
- Parameters:
AHBA_mean (pd.DataFrame) – Regional mean expression matrix from AHBA data. (regions x genes)
sn_ex (pd.DataFrame) – Processed snRNA-seq expression (smooth_cells x genes)
region (list of str) – List of brain region names to include in the analysis.
common_genes (list of str) – List of gene names present in both AHBA and snRNA-seq datasets.
- Returns:
DataFrame with genes and their cross-modality spatial correlation scores
- Return type:
pd.DataFrame
- pipeline.code.dintegrate.utils.calculate_thresholds(ahba_mean, sn_data, align_csv, percentile=0.1)[source]
Calculate regional correlation thresholds between AHBA and single-nucleus data.
- Parameters:
ahba_mean (pd.DataFrame) – AHBA mean expression matrix (regions x genes), where each row corresponds to a brain region and each column to a gene.
sn_data (pd.DataFrame) – Single-nucleus expression data (smooth_cells x genes), with cell-level expression profiles aligned to brain regions.
align_csv (pd.DataFrame) – Alignment table containing two columns: ‘brain_region’ for AHBA region names and ‘sn_region’ for corresponding snRNA-seq region labels.
percentile (float, optional) – Top percentile used to define the correlation threshold (default is 0.1, i.e., top 10%).
- Returns:
A dictionary mapping each AHBA region (str) to a tuple (correlation, p-value), representing the correlation threshold at the given percentile.
- Return type:
dict
- class pipeline.code.dintegrate.main.DataIntegrator(config)[source]
Bases:
objectSingle-nucleus dataset and AHBA dataset integration processor.
For weighted integration of Allen Brain Atlas sample expression data with single-nucleus RNA-seq data to construct integrated expression matrices. Supports parallel processing and optimized downsampling.
- __init__(config)[source]
Initialize the data integration processor.
- Parameters:
config (object) – Configuration object containing parameters for integration.
- run()[source]
Execute the integration pipeline.
- Return type:
None
- Steps:
Construct the cell pool if cfg.pool is True.
Clean the pool to remove repeatedly assigned cells.
Perform weighted integration over multiple iterations.
Generate final integrated anndata files.
- pool_construction(row)[source]
Retain smooth cells with expression similarity in the top 10% of AHBA regional mean expression values for the sampling pool.
- Parameters:
row (pandas.Series) – A row from the alignment CSV representing a brain region mapping.
- Return type:
None
- clean_pool()[source]
Screening of repeatedly assigned cells.
Cells that are assigned to more than a configured threshold number of regions are removed from the integration pool.
- Return type:
None
- weighted_average(brain_region)[source]
Perform weighted integration of cell expressions from the pool with samples.
- Parameters:
brain_region (str) – The brain region for which integration is performed.
- downsample_data(region_pool)[source]
Downsample cells from the constructed pool.
- Parameters:
region_pool (pandas.DataFrame) – DataFrame of pooled cells for a brain region.
- Returns:
Downsampled pool DataFrame.
- Return type:
pandas.DataFrame
- generate_integrated_anndata(k_=0)[source]
Generate integrated AnnData object and save as h5ad file.
Reads the integrated data of all brain regions, optionally performs z-score normalization, shuffles samples, and stores the result as an h5ad file.
- Parameters:
k (int, optional) – Iteration index for output file naming (default is 0).