Skip to contents

Creates synthetic multi-modal single-cell data following the generative model X = U * S * t(V), where U is a sample-specific cell embedding matrix, S is a diagonal singular value matrix, and V is a sample-specific feature loading matrix. The number of modalities (1 to 3) is determined by the length of n_features. The simulation generates 20 samples in 2 conditions (10 per condition), with 10 latent cell types.

Usage

simulate_multimodal_data(
  simulation_type = "DC",
  n_features = c(1000, 800, 600),
  signal_prop = 0.2,
  fold_change = 2,
  seed = 1,
  rescale = TRUE,
  sample_noise_level = 0.5,
  feature_noise_level = 0.1,
  cell_noise_level = 0.1,
  nonlinear = FALSE,
  singular_values = NULL,
  add_batch = TRUE
)

Arguments

simulation_type

Character. "DC" for differential connectivity or "DE" for differential expression.

n_features

Integer vector specifying the number of features per modality. Its length determines the number of modalities (1 to 3). Default: c(1000, 800, 600) (3 modalities).

signal_prop

Numeric between 0 and 1. Proportion of features affected by DC or DE signal in each modality. Default: 0.2.

fold_change

Numeric. Fold change applied to affected features in condition B. Only used when simulation_type = "DE". Default: 2.

seed

Integer. Random seed for reproducibility.

rescale

Logical. For DC simulation only: if TRUE (default), rescale mean expression of DC features in condition B to match condition A, ensuring a pure connectivity signal without abundance change.

sample_noise_level

Numeric. Standard deviation of Gaussian noise added to each sample's feature loading matrix. Controls between-sample variability. Default: 0.5.

feature_noise_level

Numeric. Standard deviation of Gaussian noise around feature cluster centers in the loading space. Controls how tight feature clusters are. Default: 0.1.

cell_noise_level

Numeric. Standard deviation of Gaussian noise around cell type cluster centers in the cell embedding space. Controls cell type separation. Default: 0.1.

nonlinear

Logical. If TRUE, apply a sigmoid transformation (1 / (1 + exp(-x))) to the expression matrices after the linear generative step. Simulates nonlinear gene regulation. Default: FALSE.

singular_values

Numeric vector of length r (latent rank, default 10) specifying the diagonal entries of the singular value matrix. If NULL (default), uses seq(5, 1, length.out = 10).

add_batch

Logical. If TRUE (default), add per-feature batch-specific scale and shift effects across 4 batches.

Value

A list with:

seurat_list

A named list of Seurat objects, one per modality (named "modality_1", "modality_2", etc.). Each uses assay "originalexp". Feature metadata includes feature_cluster (integer 1-10) and is_de (logical). Cell metadata includes sample_id, condition ("A" or "B"), batch, and true_cell_type.

raw_matrix_list

A named list (one per modality) of lists containing 20 raw expression matrices (cells x features), one per sample.

sample_metadata

Data frame with columns sample_id, condition, and batch.

Details

Two simulation modes are supported:

  • DC (Differential Connectivity): Feature loadings in condition B are permuted across clusters for a fraction of features, rewiring their inter-feature relationships while optionally preserving mean expression.

  • DE (Differential Expression): A multiplicative fold change is applied to a fraction of features in condition B, altering abundance without changing connectivity.

The returned Seurat objects include per-feature metadata (feature_cluster and is_de) accessible via seurat_obj[["originalexp"]][[]]$feature_cluster, which provides ground truth for benchmarking.

See also

run_MOSAIC to run the MOSAIC pipeline on the simulated data.

Examples

if (FALSE) { # \dontrun{
# --- Simulate 2-modality DC data with nonlinearity ---
sim <- simulate_multimodal_data(
  simulation_type = "DC",
  n_features = c(1000, 800),
  signal_prop = 0.2,
  seed = 42,
  nonlinear = TRUE
)

# Ground truth
is_dc <- sim$seurat_list[[1]][["originalexp"]][[]]$is_de

# Run MOSAIC
result <- run_MOSAIC(
  sim$seurat_list,
  assays = rep("originalexp", 2),
  sample_meta = "sample_id",
  condition_meta = "condition"
)

# --- Simulate 3-modality DE data ---
sim_de <- simulate_multimodal_data(
  simulation_type = "DE",
  n_features = c(1000, 800, 600),
  signal_prop = 0.3,
  fold_change = 3,
  seed = 123
)

# --- Simulate 1-modality with custom singular values ---
sim_1mod <- simulate_multimodal_data(
  n_features = c(500),
  singular_values = seq(10, 1, length.out = 10),
  seed = 1
)
} # }