Build Figure 1 Cache

Build Figure 1 Cache

Run this notebook once after helpers/survey.ipynb (i.e., after you have gtex_data/gene_expr.pkl and gtex_data/sample_meta.csv).

It slices the full gene_expr.pkl (all GTEx genes) down to the 20 display genes and saves the result as gtex_data/figure1_expr_cache.pkl.gz (~500 KB).

That small file is committed to the repo. figure1.ipynb loads it and then runs dabest.combine() live so the bootstraps, whorlmap, and all downstream statistics are computed fresh — not pre-baked.

Runtime here: < 1 s. The 10–15 min bootstrap is deferred to figure1.ipynb.

Code
import pickle, gzip, pathlib
import pandas as pd
Code
RIGHT_GENES = [
    'TPH2', 'CHRNA7', 'ESR1',
    'TH', 'SLC6A3', 'DDC', 'AGRP',
    'SST', 'PENK', 'GAD1', 'GAD2', 'CRH',
    'DRD5', 'CHRM1', 'BDNF', 'CYP19A1',
    'AIF1', 'MAOA', 'FKBP5', 'GFAP',
]
BRAIN_REGIONS = {
    'Hypothalamus':      'Brain - Hypothalamus',
    'Amygdala':          'Brain - Amygdala',
    'Hippocampus':       'Brain - Hippocampus',
    'Ant. Cing. Ctx':    'Brain - Anterior cingulate cortex (BA24)',
    'Frontal Cortex':    'Brain - Frontal Cortex (BA9)',
    'Cortex':            'Brain - Cortex',
    'Caudate':           'Brain - Caudate (basal ganglia)',
    'Putamen':           'Brain - Putamen (basal ganglia)',
    'Nucleus Accumbens': 'Brain - Nucleus accumbens (basal ganglia)',
    'Cerebellum':        'Brain - Cerebellum',
    'Cerebellar Hemi.':  'Brain - Cerebellar Hemisphere',
    'Substantia Nigra':  'Brain - Substantia nigra',
    'Spinal Cord':       'Brain - Spinal cord (cervical c-1)',
}
region_order = list(BRAIN_REGIONS.keys())
DATA_DIR     = pathlib.Path('gtex_data')

1. Load raw data

Code
with open(DATA_DIR / 'gene_expr.pkl', 'rb') as f:
    gene_expr = pickle.load(f)
sample_meta = pd.read_csv(DATA_DIR / 'sample_meta.csv', index_col=0)
print(f'Full gene_expr: {len(gene_expr)} genes  |  sample_meta: {len(sample_meta)} samples')

2. Slice to the 20 display genes

gene_expr contains every protein-coding gene from GTEx (~56 000 entries). We only need the 20 neuroactive / glial genes shown in Figure 1.

Code
expr_subset = {gene: gene_expr[gene] for gene in RIGHT_GENES}
n_samples   = len(next(iter(expr_subset.values())))
print(f'Sliced to {len(expr_subset)} genes  x  {n_samples} brain-region samples')

3. Save cache

Code
cache = {
    'expr_subset':  expr_subset,   # {gene_name: {sample_id: log2(tpm+1)}}
    'gene_names':   RIGHT_GENES,
    'region_order': region_order,
}

cache_path = DATA_DIR / 'figure1_expr_cache.pkl.gz'
with gzip.open(cache_path, 'wb') as f:
    pickle.dump(cache, f, protocol=5)

size_kb = cache_path.stat().st_size / 1e3
print(f'Saved {cache_path}  ({size_kb:.0f} KB)  —  safe to commit directly.')

Cache ready. Open figure1.ipynb — it will load this file and run dabest.combine() live.