Generate Simulated Multi-Study Factor Analysis Data — gendata_simu

Generate simulated data for multi-study factor analysis under different error distributions. The data follows a factor model with common factors (shared across studies) and study-specific factors (unique to each study), plus noise.

gendata_simu_multi(
  seed = 1,
  nvec = c(100, 300),
  p = 50,
  q = 3,
  qs = rep(2, length(nvec)),
  err.type = c("gaussian", "mvt", "exp", "t", "mixnorm", "pareto"),
  rho = c(1, 1),
  sigma2_eps = 0.1,
  nu = 1
)

Arguments

seed

Integer, default = 1. Random seed for reproducibility of simulated data.

nvec

Numeric vector (length >= 2). Sample sizes of each study (e.g., `c(150, 200)` for 2 studies with 150 and 200 samples).

p

Integer, default = 50. Number of variables (features) in the data.

q

Integer, default = 3. Number of common factors (shared across all studies).

qs

Numeric vector with length equal to `length(nvec)`, default = `rep(2, length(nvec))`. Number of study-specific factors for each study (e.g., `c(2,2)` for 2 studies each with 2 specific factors).

err.type

Character, default = "gaussian". Error distribution type, one of: - "gaussian": Gaussian (normal) distribution;

- "mvt": Multivariate t-distribution;

- "exp": Exponential distribution (centered to mean 0);

- "t": Univariate t-distribution (independent across variables);

- "mixnorm": Mixture of two normal distributions;

- "pareto": Pareto distribution (centered to mean 0).

rho

Numeric vector of length 2, default = `c(1,1)`. Scaling factors for: - `rho1`: Common factor loadings (matrix `A0`); - `rho2`: Study-specific factor loadings (matrix list `Blist0`).

sigma2_eps

Numeric, default = 0.1. Variance of the error term (controls noise level).

nu

Integer, default = 1. Degrees of freedom for t-distribution ("mvt" or "t" `err.type`). Ignored for other error distributions.

Value

A list containing the simulated data and true parameter values (for model evaluation):

Xlist: List of matrices. Each element is a data matrix (ns × p) for study s, where ns = `nvec[s]` (sample size of study s), p = number of variables.
mu0: Matrix (p × S). True mean vector for each variable (row) in each study (column), where S = `length(nvec)` (number of studies).
A0: Matrix (p × q). True common factor loadings (shared across all studies) — constructed as the first q columns of an orthogonal matrix (`A1`) generated internally. This is the "ground truth" that modeling functions (e.g., MultiRFM) aim to estimate.
Blist0: List of matrices. Each element is a true study-specific factor loadings matrix (p × qs[s]) for study s. Constructed from orthogonal matrices (similar to `A0`) and scaled by `rho[2]`. Another "ground truth" for model evaluation.
Flist: List of matrices. Each element is a true common factor score matrix (ns × q) for study s, generated from a standard normal distribution. These are the latent common factor values used to generate `Xlist`.
Hlist: List of matrices. Each element is a true study-specific factor score matrix (ns × qs[s]) for study s, generated from a standard normal distribution. Latent specific factor values used to generate `Xlist`.
q: Integer. Number of common factors used for data generation (same as input `q`, for reference).
qs: Numeric vector. Number of study-specific factors used for data generation (same as input `qs`, for reference).

Details

The simulated data follows the multi-study factor model:

Xs = mu0s + Fs x A0 + Hs x B0s + epsilons

True parameters (`A0`, `Blist0`, `mu0`) are generated with orthogonal constraints to ensure identifiability.

Author

Wei Liu

Examples

# Example 1: Gaussian error (2 studies, 100/200 samples, 50 variables)
set.seed(123)
sim_data <- gendata_simu_multi(
  seed = 123,
  nvec = c(100, 200),
  p = 50,
  q = 3,          # 3 common factors
  qs = c(2, 2),   # 2 specific factors per study
  err.type = "gaussian",
  rho = c(1, 1),
  sigma2_eps = 0.1
)
str(sim_data)  # Check structure of simulated data
#> List of 8
#>  $ Xlist :List of 2
#>   ..$ : num [1:100, 1:50] 0.322 -0.524 1.629 -2.444 -1.823 ...
#>   .. ..- attr(*, "dimnames")=List of 2
#>   .. .. ..$ : NULL
#>   .. .. ..$ : NULL
#>   ..$ : num [1:200, 1:50] -1.086 0.464 -0.245 -1.001 -0.08 ...
#>   .. ..- attr(*, "dimnames")=List of 2
#>   .. .. ..$ : NULL
#>   .. .. ..$ : NULL
#>  $ mu0   : num [1:50, 1:2] -0.5605 -0.2302 1.5587 0.0705 0.1293 ...
#>  $ A0    : num [1:50, 1:3] 0.496 -0.18 0.172 0.243 0.665 ...
#>  $ Blist0:List of 2
#>   ..$ : num [1:50, 1:2] 0.131 0.1408 0.1362 -0.0422 -0.4109 ...
#>   ..$ : num [1:50, 1:2] 0.2264 -0.0467 -0.4008 0.2853 0.0203 ...
#>  $ Flist :List of 2
#>   ..$ : num [1:100, 1:3] -0.6042 0.9134 1.6575 -0.0389 -0.3313 ...
#>   .. ..- attr(*, "dimnames")=List of 2
#>   .. .. ..$ : NULL
#>   .. .. ..$ : NULL
#>   ..$ : num [1:200, 1:3] -0.0731 -0.1409 -0.7031 0.2109 0.797 ...
#>   .. ..- attr(*, "dimnames")=List of 2
#>   .. .. ..$ : NULL
#>   .. .. ..$ : NULL
#>  $ Hlist :List of 2
#>   ..$ : num [1:100, 1:2] 0.25352 0.24352 0.71959 0.77273 -0.00852 ...
#>   .. ..- attr(*, "dimnames")=List of 2
#>   .. .. ..$ : NULL
#>   .. .. ..$ : NULL
#>   ..$ : num [1:200, 1:2] -1.9131 -0.3297 -0.9296 0.1346 0.0831 ...
#>   .. ..- attr(*, "dimnames")=List of 2
#>   .. .. ..$ : NULL
#>   .. .. ..$ : NULL
#>  $ q     : num 3
#>  $ qs    : num [1:2] 2 2

# Extract true parameters for model evaluation
true_A <- sim_data$A0        # True common loadings
true_B1 <- sim_data$Blist0[[1]]  # True specific loadings (study 1)