locat.preprocessing module#

locat.preprocessing.DEFAULT_PSEUDO_PATTERNS = ['^Gm\\d+', 'Rik$', '^AC\\d+', '^AA\\d+', '^A[0-9]{6,}', '^Mir\\d+', '^Rpl\\d*-\\d+', '^Rps\\d*-\\d+', '^Linc'][source]#

Default regex patterns for pseudogene-like gene symbols.

locat.preprocessing.filter_genes(adata, pseudo_patterns=None, min_cell_frac=0.01)[source]#

Filter genes from an AnnData object.

Removes pseudogene-like symbols matched by pseudo_patterns and genes expressed in fewer than min_cell_frac of cells.

Parameters:
adata:

AnnData object whose .var_names are gene symbols and .X is an expression matrix (dense or sparse).

pseudo_patterns:

List of regex patterns identifying pseudogene-like symbols. Defaults to DEFAULT_PSEUDO_PATTERNS.

min_cell_frac:

Minimum fraction of cells that must express a gene (X > 0) for it to be retained. Default is 0.01 (1 %).

Returns:
AnnData

A filtered copy of adata.

locat.preprocessing.get_embedding(adata, key='X_pca', n_dims=None)[source]#

Extract a cell embedding from an AnnData object as a float64 array.

Parameters:
adata:

AnnData object with the embedding stored in obsm.

key:

Key in adata.obsm to extract (default: "X_pca").

n_dims:

Number of dimensions to keep. If None, all dimensions are returned.

Returns:
np.ndarray

2-D float64 array of shape (n_cells, n_dims).