LOCAT Tutorial: gene localization in PBMC3K#

Overview#

This tutorial walks through the full LOCAT workflow on the PBMC3K dataset, a RNA-Seq dataset of 3,000 peripheral blood mononuclear cells (PBMCs) from a healthy donor (10x Genomics). LOCAT identifies genes whose expression is spatially structured in the cell landscape, focusing in particular on those that are concentrated in a specific region and depleted elsewhere.

That means we will:

Load the preprocessed PBMC3K object from scanpy.datasets.pbmc3k_processed()
Extract the raw data matrix and filter genes
Run Locat to score every gene for spatial concentration and depletion
Describe the results of the LOCAT analysis
Visualize the top concentrated and top concentrated-and-depleted genes

import numpy as np
import pandas as pd
import scanpy as sc

from locat.locat import LOCAT
from locat.postprocessing import build_locat_df
from locat.preprocessing import filter_genes, get_embedding
from locat.plotting_and_other_methods import plot_grid

LOCAT uses JAX for GPU-accelerated computation.

If a GPU is available it will be used automatically. If a GPU is not detected, JAX runs on the CPU, which is slower but provides the same results.

import jax
import locat

print(f"locat version:  {locat.__version__}")
print(f"JAX version:    {jax.__version__}")
print(f"JAX backend:    {jax.default_backend()}")
print(f"JAX devices:    {jax.devices()}")

locat version:  0.1.5.dev3+ge9c813317.d20260404
JAX version:    0.6.2
JAX backend:    gpu
JAX devices:    [CudaDevice(id=0)]

Preprocessing#

We use the Scanpy PBMC3K dataset: 3,000 peripheral blood mononuclear cells (PBMCs) from a healthy donor (10x Genomics), covering eight immune cell types — CD4+ and CD8+ T cells, B cells, NK cells, CD14+ and FCGR3A+ monocytes, dendritic cells, and megakaryocytes. After quality control, the object contains 2,638 cells.

We use pbmc3k_processed because it is a well-known benchmark dataset and it stores a PCA we want to reuse as the LOCAT embedding. We will not use the Scanpy-processed gene matrix — instead we recover the raw counts from .raw and apply our own minimal filtering, so that LOCAT can score the full gene universe rather than just the highly variable genes.

The function sc.datasets.pbmc3k_processed() will download the dataset to the file data/pbmc3k_processed.h5ad if not already existing. If you wish to change the location where the file is stored, run sc.settings.datasetdir = Path("/data/location/").

adata_pbmc3k_processed = sc.datasets.pbmc3k_processed()
adata_pbmc3k = adata_pbmc3k_processed.raw.to_adata()
adata_pbmc3k

AnnData object with n_obs × n_vars = 2638 × 13714
    obs: 'n_genes', 'percent_mito', 'n_counts', 'louvain'
    var: 'n_cells'
    uns: 'draw_graph', 'louvain', 'louvain_colors', 'neighbors', 'pca', 'rank_genes_groups'
    obsm: 'X_pca', 'X_tsne', 'X_umap', 'X_draw_graph_fr'
    obsp: 'distances', 'connectivities'

Next, we apply filter_genes to remove two classes of genes that carry little localization signal:

Pseudogene-like symbols matched by name patterns (e.g. Gm*, *Rik, Linc*, ribosomal duplicates) — these are typically unannotated or non-coding loci
Very lowly expressed genes present in fewer than 1% of cells — too noisy and sparse for being biologically relevant since they are only expressed in a hanful of cells

n_before = adata_pbmc3k.n_vars
adata_locat = filter_genes(adata_pbmc3k)
print(f"Genes retained: {adata_locat.n_vars} / {n_before}")

Genes retained: 8859 / 13714

Run LOCAT#

LOCAT fits a Gaussian mixture model (GMM) to the cell embedding and scores each gene by comparing how expressing cells are distributed in space against the background distribution of all cells. It returns a locat_df table with one row per gene.

LOCAT(...) — model setup

cell_embedding: the PCA coordinates used as the cell space — here the first 8 PCs, which capture the main structure of the PBMC populations
knn_mode="connectivity": use the Scanpy neighbor graph edge weights (soft connectivity) rather than binary presence/absence, giving a smoother local density estimate
n_bootstrap_inits=10: number of random GMM initialisations per bootstrap sample — reduced from the default (50) for speed; increase for a more thorough analysis
reg_covar=1e-6: small value added to the GMM covariance diagonal for numerical stability

gmm_scan(...) — gene scoring

rc_lambda_values: the grid of concentration factors scanned by the depletion test — a coarser grid (linspace(1, 2, 8)) is used here for speed; revert to default for a more thorough analysis

model = LOCAT(
    adata=adata_locat,
    cell_embedding=get_embedding(adata_locat, n_dims=8),
    n_bootstrap_inits=10,
    show_progress=True,
    knn_mode="connectivity",
    reg_covar=1e-6,
)

locat_df = build_locat_df(model.gmm_scan(
    rc_lambda_values=np.linspace(1.0, 2.0, 8),
))

2026-04-06 10:22:03.752 | INFO     | locat.locat:background_pdf:438 - fitting background PDF
2026-04-06 10:22:03.752 | INFO     | locat.locat:background_n_components_init:208 - Estimating number of GMM components
estimating BIC for 2638 cells: 100%|██████████| 9/9 [00:06<00:00,  1.35it/s]
estimating BIC for 51 cells: 100%|██████████| 30/30 [00:00<00:00, 35.61it/s]
2026-04-06 10:22:11.501 | INFO     | locat.locat:background_pdf:450 - Using 5 components
fitting background: 100%|██████████| 10/10 [00:00<00:00, 71.69it/s]
null distribution parameters (perm. pseudo-genes):   0%|          | 0/7 [00:00<?, ?it/s]2026-04-06 10:22:11.781 | INFO     | locat.locat:min_dist:144 - recomputing min cell-cell distance
2026-04-06 10:22:11.782 | INFO     | locat.locat:cell_dist:136 - recomputing cell-cell distance
null distribution parameters (perm. pseudo-genes): 100%|██████████| 7/7 [00:05<00:00,  1.22it/s]
scanning genes: 100%|██████████| 8859/8859 [01:31<00:00, 96.45it/s] 

Results#

build_locat_df converts the raw LOCAT output into a tidy DataFrame with one row per gene. Genes are sorted by combined p-value so the most significant appear first.

Concentration (is a gene clustered in space?)

zscore: how much more concentrated the gene’s expressing cells are compared to the background GMM — higher means more clustered
concentration_pval: p-value for the concentration test
is_conc_sig: True if concentration_pval < 0.05

Depletion (is a gene absent from some regions?)

depletion_pval: p-value for the depletion test — tests whether the gene is significantly under-represented in at least one spatial region
max_deficit: the largest gap between the expected and observed fraction of expressing cells across all scanned regions — a direct measure of depletion strength
is_depl_sig: True if depletion_pval < 0.05

Combined

pval: combined p-value integrating concentration and depletion signals
is_joint_sig: True if pval < 0.05

All columns are described in the LocatResult documentation.

locat_df.head(10)

	gene_name	bic	zscore	sens_score	depletion_pval	concentration_pval	h_size	h_sens	pval	K_components	sample_size	max_deficit	is_conc_sig	is_depl_sig	is_joint_sig
gene
RNASE2	RNASE2	5.9579	7.328457	1.0	0.0	0.0	0.016529	0.0	0.000826	2	59.0	0.991961	True	True	True
RP5-887A10.1	RP5-887A10.1	-3208566051.172291	3.904195	1.0	0.0	0.000047	0.027396	0.0	0.00137	2	35.0	0.9801	True	True	True
ICAM4	ICAM4	8.699026	3.790885	1.0	0.0	0.000075	0.027396	0.0	0.00137	2	35.0	0.96951	True	True	True
BACE2	BACE2	-5889761859.259194	3.591428	1.0	0.0	0.000164	0.028983	0.0	0.001449	2	33.0	0.980948	True	True	True
RP11-428G5.5	RP11-428G5.5	-11592630.901236	3.46296	1.0	0.0	0.000267	0.030767	0.0	0.001538	2	31.0	0.983796	True	True	True
MYOF	MYOF	9.55608	3.367864	1.0	0.0	0.000379	0.030767	0.0	0.001538	2	31.0	0.994927	True	True	True
FXYD6	FXYD6	9.660164	3.364742	1.0	0.0	0.000383	0.030767	0.0	0.001538	2	31.0	0.993377	True	True	True
OSBPL10	OSBPL10	8.539526	3.326886	1.0	0.0	0.000439	0.031743	0.0	0.001587	2	30.0	0.972077	True	True	True
ARHGEF10L	ARHGEF10L	10.056512	3.044539	1.0	0.0	0.001165	0.032784	0.0	0.001639	2	29.0	0.993925	True	True	True
COBLL1	COBLL1	-199.530195	2.99084	1.0	0.0	0.001391	0.035084	0.0	0.001754	2	27.0	0.980676	True	True	True

Next, we count significant genes (p < 0.05) in each category.

summary = pd.Series({
    "genes_scored": int(locat_df.shape[0]),
    "conc_sig":     int(locat_df["is_conc_sig"].sum()),
    "depl_sig":     int(locat_df["is_depl_sig"].sum()),
    "joint_sig":    int(locat_df["is_joint_sig"].sum()),
    "conc_only":    int((locat_df["is_conc_sig"] & ~locat_df["is_depl_sig"]).sum()),
    "conc_depl":    int((locat_df["is_conc_sig"] &  locat_df["is_depl_sig"]).sum()),
})
display(summary.to_frame("count"))

	count
genes_scored	8754
conc_sig	1827
depl_sig	8625
joint_sig	541
conc_only	90
conc_depl	1737

Top genes#

A key feature of LOCAT is that it identifies genes that are not only concentrated in a specific region but also depleted elsewhere. Genes that are also depleted tend to be more specific markers of that region.

We select the top 6 genes from each category and visualize their expression on the UMAP.

Concentration-only genes are significantly enriched in a specific region but are not absent from others. These tend to mark a cell type or state without being strictly excluded elsewhere. We rank by concentration strength (zscore).

genes_conc_only = (locat_df[locat_df["is_conc_sig"] & ~locat_df["is_depl_sig"]]
    .sort_values(["zscore", "max_deficit", "concentration_pval"], ascending=[False, False, True])
    .index[:6])
locat_df.loc[genes_conc_only, ["zscore", "concentration_pval", "depletion_pval", "max_deficit", "sample_size"]]

	zscore	concentration_pval	depletion_pval	max_deficit	sample_size
gene
IL7R	12.4369	0.0	0.269713	0.073749	1062.0
CXCR4	10.344442	0.0	0.109393	0.120904	1585.0
LCK	10.005871	0.0	1.0	-0.038012	953.0
MAL	9.787817	0.0	1.0	-0.013983	353.0
CD27	9.176478	0.0	1.0	-0.15102	632.0
PRKCQ-AS1	8.890096	0.0	1.0	-0.117591	445.0

Concentration-plus-depletion genes are both enriched in one region and absent from others — the strongest signature of spatial localization. We rank by depletion strength (max_deficit), which measures how large the gap is between where the gene is expected and where it is actually expressed.

genes_conc_depl = (locat_df[locat_df["is_conc_sig"] & locat_df["is_depl_sig"]]
    .sort_values(["max_deficit", "depletion_pval", "concentration_pval"], ascending=[False, True, True])
    .index[:6])
locat_df.loc[genes_conc_depl, ["zscore", "concentration_pval", "depletion_pval", "max_deficit", "sample_size"]]

	zscore	concentration_pval	depletion_pval	max_deficit	sample_size
gene
VMO1	3.141256	0.000841	0.0	0.998259	29.0
MYOF	3.367864	0.000379	0.0	0.994927	31.0
TNNT1	3.126928	0.000883	0.0	0.994738	32.0
ARHGEF10L	3.044539	0.001165	0.0	0.993925	29.0
FXYD6	3.364742	0.000383	0.0	0.993377	31.0
RNASE2	7.328457	0.0	0.0	0.991961	59.0

fig = plot_grid(
    adata_locat,
    gene_lists=[genes_conc_only, genes_conc_depl],
    titles=["Top conc-only", "Top conc + depleted"],
)

../_images/bc9b6bf98228d5e59ce92bd40be0af153c78a1315645f9b4d8868605c17d4236.png

Exporting results#

The results table can be exported as a CSV and the figure as an SVG.

locat_df[["zscore", "concentration_pval", "depletion_pval", "pval", "max_deficit", "sample_size", "K_components"]]\
    .to_csv("PBMC_locat_results.csv")

fig.savefig("locat_top_genes.svg", bbox_inches="tight")