MMGFM: simulation1

This vignette introduces the usage of MMGFM for the analysis of high-dimensional multi-study multi-modality data with additional covariates, by comparison with other methods.

The package can be loaded with the command:

library(MMGFM)
#> Loading required package: irlba
#> Loading required package: Matrix
#> MMGFM :  We introduce a generalized factor model designed to jointly analyze high-dimensional multi-modality data from multiple studies by extracting study-shared and specified factors. Our factor models account for heterogeneous noises and overdispersion among modality variables with augmented  covariates. We propose an efficient and speedy variational estimation procedure for estimating model parameters, along with a novel criterion for selecting the optimal number of  factors. More details can be referred to Liu et al. (2024) <doi:10.48550/arXiv.2408.10542>.

Next, we load some functions for subsequent use.

source("https://raw.githubusercontent.com/feiyoung/MMGFM/refs/heads/main/simu_code/definedFunc.R")
#> Loading required package: doSNOW
#> Loading required package: foreach
#> Loading required package: iterators
#> Loading required package: snow
#> Loading required package: parallel
#> 
#> Attaching package: 'parallel'
#> The following objects are masked from 'package:snow':
#> 
#>     closeNode, clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
#>     clusterExport, clusterMap, clusterSplit, makeCluster, parApply,
#>     parCapply, parLapply, parRapply, parSapply, recvData, recvOneData,
#>     sendData, splitIndices, stopCluster
#> GFM :  Generalized factor model is implemented for ultra-high dimensional data with mixed-type variables.
#> Two algorithms, variational EM and alternate maximization, are designed to implement the generalized factor model,
#> respectively. The factor matrix and loading matrix together with the number of factors can be well estimated.
#> This model can be employed in social and behavioral sciences, economy and finance, and  genomics,
#> to extract interpretable nonlinear factors. More details can be referred to
#> Wei Liu, Huazhen Lin, Shurong Zheng and Jin Liu. (2021) <doi:10.1080/01621459.2021.1999818>.   Check out our Package website (https://feiyoung.github.io/GFM/docs/index.html) for a more complete description of the methods and analyses

Generate the simulated data

First, we generate the simulated data from three data sources and three modalities, with each modality consisting of count variables.

N <- 100
q <- 3
qsvec  <- rep(2,3)
sigma_eps <- 1
datlist <- gendata_mmgfm(seed = 1,  nvec = c(300, 200, 100),  
                        pveclist = list('poisson'=c(50, 150, 200)),
                        q = q,  d= 3,qs = qsvec,  rho = 2, rho_z=0.5,
                        sigmavec=1,  sigma_eps=sigma_eps)
XList <- datlist$XList
max(unlist(XList))
#> [1] 76909083
print(str(XList))
#> List of 3
#>  $ :List of 1
#>   ..$ : int [1:300, 1:400] 0 2 1 0 4 12 48 296 911 36 ...
#>  $ :List of 1
#>   ..$ : int [1:200, 1:400] 0 50 11 25 127 1 13 485 0 0 ...
#>  $ :List of 1
#>   ..$ : int [1:100, 1:400] 32 1 48 50 1 602 2 0 40 166 ...
#> NULL
ZList <- datlist$ZList # covariates
print(head(ZList[[1]]))
#>      [,1]        [,2]       [,3]
#> [1,]    1  1.44901647  1.1020410
#> [2,]    1  0.67563588  1.3507231
#> [3,]    1  0.41854494  1.0426304
#> [4,]    1  1.03062242 -0.5053867
#> [5,]    1  0.50238638  1.3168351
#> [6,]    1 -0.03052584 -0.2719483
tauList <- datlist$tauList # offset term
numvarmat <- datlist$numvarmat

Fit the MMGFM model using the function MMGFM() in the R package MMGFM. Users can use ?MMGFM to see the details about this function

system.time({
  tic <- proc.time()
  reslist <- MMGFM(XList, ZList=ZList, numvarmat, q=q, qsvec = qsvec, init='MSFRVI')
  toc <- proc.time()
  time_MMGFM <- toc[3] - tic[3]
})
#> Data include 3 studies/sources
#> variables belong to 1 types: poisson
#> Initialization...
#> Initialization using MSFR...
#> Finish Initialization!
#> Initialize the paramters related to s!
#> Initialize the paramters not related to s!
#> iter = 2, ELBO= 11729754929.875647, dELBO=1.000000 
#> iter = 3, ELBO= 11729771481.147783, dELBO=0.000001 
#> iter = 4, ELBO= 11729781373.544777, dELBO=0.000001 
#> iter = 5, ELBO= 11729788256.960657, dELBO=0.000001 
#> iter = 6, ELBO= 11729793446.962976, dELBO=0.000000 
#> iter = 7, ELBO= 11729797561.288710, dELBO=0.000000 
#> iter = 8, ELBO= 11729800893.069698, dELBO=0.000000 
#> iter = 9, ELBO= 11729803634.156563, dELBO=0.000000 
#> iter = 10, ELBO= 11729805950.517712, dELBO=0.000000 
#> iter = 11, ELBO= 11729807974.774397, dELBO=0.000000 
#> iter = 12, ELBO= 11729809783.381897, dELBO=0.000000 
#> iter = 13, ELBO= 11729811396.493315, dELBO=0.000000 
#> iter = 14, ELBO= 11729812805.426003, dELBO=0.000000 
#> iter = 15, ELBO= 11729814005.241312, dELBO=0.000000 
#> iter = 16, ELBO= 11729815009.644564, dELBO=0.000000 
#> iter = 17, ELBO= 11729815847.666649, dELBO=0.000000 
#> iter = 18, ELBO= 11729816554.466454, dELBO=0.000000 
#> iter = 19, ELBO= 11729817164.732288, dELBO=0.000000 
#> iter = 20, ELBO= 11729817710.103460, dELBO=0.000000 
#> iter = 21, ELBO= 11729818218.928200, dELBO=0.000000 
#> iter = 22, ELBO= 11729818716.410471, dELBO=0.000000 
#> iter = 23, ELBO= 11729819223.703918, dELBO=0.000000 
#> iter = 24, ELBO= 11729819755.263786, dELBO=0.000000 
#> iter = 25, ELBO= 11729820315.107479, dELBO=0.000000 
#> iter = 26, ELBO= 11729820894.376608, dELBO=0.000000 
#> iter = 27, ELBO= 11729821473.087643, dELBO=0.000000 
#> iter = 28, ELBO= 11729822026.580221, dELBO=0.000000 
#> iter = 29, ELBO= 11729822533.493935, dELBO=0.000000 
#> iter = 30, ELBO= 11729822981.000059, dELBO=0.000000
#>    user  system elapsed 
#>    5.11    0.39    1.79

Check the increased property of the envidence lower bound function.

library(ggplot2)
library(scales)
dat_iter <- data.frame(iter=1:length(reslist$ELBO_seq), ELBO=reslist$ELBO_seq)
ggplot(data=dat_iter, aes(x=iter, y=ELBO)) + geom_line() + geom_point() + theme_bw(base_size = 20) +
  scale_y_continuous(labels = label_scientific(digits = 8))

We calculate the metrics to measure the estimation accuracy, where the mean trace statistic is used to measure the estimation accuracy of loading matrix and prediction accuracy of factor matrix, which is evaluated by the function measurefun() in the R package GFM, and the root of mean absolute error is adopted to measure the estimation error of beta.


methodNames <- c("MMGFM", "GFM", "MRRR", "MSFR", "MultiCOAP")
n_methods <- length(methodNames)
metricList <- list(F_tr = rep(NA, n_methods), 
                   H_tr = rep(NA,  n_methods), 
                   V_tr = rep(NA,  n_methods), 
                   A_tr = rep(NA, n_methods),
                   B_tr = rep(NA,  n_methods),
                   beta_norm=rep(NA,  n_methods),
                   time = rep(NA,  n_methods))
for(ii in seq_along(metricList)) names(metricList[[ii]]) <- methodNames
metricList$F_tr[1] <- meanTr(reslist$hF, datlist$F0List)
metricList$H_tr[1] <-meanTr(reslist$hH, datlist$H0List)
metricList$V_tr[1] <-meanTr(lapply(reslist$hv, function(x) Reduce(cbind,x) ), datlist$VList)
metricList$A_tr[1] <-metric_mean(AList=reslist$hA, datlist$A0List, align='unaligned', numvarmat = numvarmat)
metricList$B_tr[1] <- mean(ms_metric_mean(reslist$hB, datlist$B0List, align='unaligned', numvarmat = numvarmat))
metricList$beta_norm[1] <-normvec(Reduce(cbind, reslist$hbeta)- Reduce(cbind,datlist$betaList))
metricList$time[1] <- reslist$time.use

Compare with other methods

We compare MMGFM with various prominent methods in the literature. They are (1) Generalized factor model (Liu et al. 2023) implemented in the R package GFM; (2) Multi-response reduced-rank Poisson regression model (MMMR, Luo et al. 2018) implemented in rrpack R package; and (3) the multi-study covariate-augmented overdispersed Poisson factor (MultiCOAP) model.

(1). First, we implemented the generalized factor model (GFM) and record the metrics that measure the estimation accuracy and computational cost.


res_gfm <- gfm_run(XList, numvarmat, q=q)
#> Starting the two-step method with varitional EM in the first step...
#> iter = 2, ELBO= 11729687952.473068, dELBO=1.000000 
#> iter = 3, ELBO= 11729712412.413713, dELBO=0.000002 
#> iter = 4, ELBO= 11729725556.903790, dELBO=0.000001 
#> iter = 5, ELBO= 11729733394.809568, dELBO=0.000001 
#> iter = 6, ELBO= 11729738466.541033, dELBO=0.000000 
#> iter = 7, ELBO= 11729741961.208078, dELBO=0.000000 
#> iter = 8, ELBO= 11729744487.668800, dELBO=0.000000 
#> iter = 9, ELBO= 11729746383.293159, dELBO=0.000000 
#> iter = 10, ELBO= 11729747847.959942, dELBO=0.000000 
#> iter = 11, ELBO= 11729749006.791277, dELBO=0.000000 
#> iter = 12, ELBO= 11729749941.723318, dELBO=0.000000 
#> iter = 13, ELBO= 11729750708.447094, dELBO=0.000000 
#> iter = 14, ELBO= 11729751346.019690, dELBO=0.000000 
#> iter = 15, ELBO= 11729751882.570513, dELBO=0.000000 
#> iter = 16, ELBO= 11729752338.830721, dELBO=0.000000 
#> iter = 17, ELBO= 11729752730.388096, dELBO=0.000000 
#> iter = 18, ELBO= 11729753069.170321, dELBO=0.000000 
#> iter = 19, ELBO= 11729753364.447330, dELBO=0.000000 
#> iter = 20, ELBO= 11729753623.518108, dELBO=0.000000 
#> iter = 21, ELBO= 11729753852.200689, dELBO=0.000000 
#> iter = 22, ELBO= 11729754055.179434, dELBO=0.000000 
#> iter = 23, ELBO= 11729754236.259623, dELBO=0.000000 
#> iter = 24, ELBO= 11729754398.560213, dELBO=0.000000 
#> iter = 25, ELBO= 11729754544.654888, dELBO=0.000000 
#> iter = 26, ELBO= 11729754676.683508, dELBO=0.000000 
#> iter = 27, ELBO= 11729754796.436195, dELBO=0.000000 
#> iter = 28, ELBO= 11729754905.421116, dELBO=0.000000 
#> iter = 29, ELBO= 11729755004.916752, dELBO=0.000000 
#> iter = 30, ELBO= 11729755096.012402, dELBO=0.000000
#> Finish the two-step method
metricList$F_tr[2] <- meanTr(res_gfm$hF, datlist$F0List)
metricList$A_tr[2] <-metric_mean(AList=res_gfm$hA, datlist$A0List, align='unaligned', numvarmat = numvarmat)
metricList$time[2] <- res_gfm$time.use

(2). Then, we implemented Multi-response reduced-rank Poisson regression model (MMMR) and recorded the metrics. Here, we truncate values to ensure normal working of this function. Otherwise, it will produce error.

res_mrrr <- mrrr_run(XList, ZList, numvarmat, q, truncflag=TRUE, trunc=500)
#> Loading required package: rrpack
#> Warning: package 'rrpack' was built under R version 4.4.2
#> iter =  1  obj/Cnorm_diff =  49345653  
#> iter =  1  obj/Cnorm_diff =  49345653  
#> iter =  1  obj/Cnorm_diff =  49345653  
#> iter =  1  obj/Cnorm_diff =  49345653  
#> iter =  1  obj/Cnorm_diff =  49345653  
#> iter =  1  obj/Cnorm_diff =  49345653  
#> iter =  1  obj/Cnorm_diff =  49345653
metricList$F_tr[3] <- meanTr(res_mrrr$hF, datlist$F0List)
metricList$A_tr[3] <- metric_mean(AList=res_mrrr$hA, datlist$A0List, align='unaligned', numvarmat = numvarmat)
metricList$beta_norm[3] <-normvec(res_mrrr$hbeta - Reduce(cbind,datlist$betaList))
metricList$time[3] <- res_mrrr$time.use

Thirdly, we implemented the zero-inflated Poisson factor model:

source("https://raw.githubusercontent.com/feiyoung/MMGFM/refs/heads/main/simu_code/MSFR_main_R_MSFR_V1.R")
## To produce results in limited time, here we set maxIter=5. Even Set maxIter=1e4, the result is also not good.
res_msfr <- MSFR_run(XList, ZList, numvarmat, q, qs=qsvec, maxIter=5, load.source=TRUE, log.transform=TRUE)
#> Loading required package: psych
#> 
#> Attaching package: 'psych'
#> The following object is masked _by_ '.GlobalEnv':
#> 
#>     tr
#> The following objects are masked from 'package:scales':
#> 
#>     alpha, rescale
#> The following objects are masked from 'package:ggplot2':
#> 
#>     %+%, alpha
#> Loading required namespace: GPArotation
#> Warning in cor.smooth(r): Matrix was not positive definite, smoothing was done
#> The determinant of the smoothed correlation was zero.
#> This means the objective function is not defined.
#> Chi square is based upon observed residuals.
#> The determinant of the smoothed correlation was zero.
#> This means the objective function is not defined for the null model either.
#> The Chi square is thus based upon observed correlations.
#> In factor.scores, the correlation matrix is singular, the pseudo inverse is  used
#> In smc, smcs < 0 were set to .0
#> In smc, smcs < 0 were set to .0
#> In smc, smcs < 0 were set to .0
#> Warning in cor.smooth(r): Matrix was not positive definite, smoothing was done
#> The determinant of the smoothed correlation was zero.
#> This means the objective function is not defined.
#> Chi square is based upon observed residuals.
#> The determinant of the smoothed correlation was zero.
#> This means the objective function is not defined for the null model either.
#> The Chi square is thus based upon observed correlations.
#> In factor.scores, the correlation matrix is singular, the pseudo inverse is  used
#> In smc, smcs < 0 were set to .0
#> In smc, smcs < 0 were set to .0
#> In smc, smcs < 0 were set to .0
#> Warning in cor.smooth(r): Matrix was not positive definite, smoothing was done
#> The determinant of the smoothed correlation was zero.
#> This means the objective function is not defined.
#> Chi square is based upon observed residuals.
#> The determinant of the smoothed correlation was zero.
#> This means the objective function is not defined for the null model either.
#> The Chi square is thus based upon observed correlations.
#> In factor.scores, the correlation matrix is singular, the pseudo inverse is  used
#> [1] 5
metricList$F_tr[4] <- meanTr(res_msfr$hF, datlist$F0List)
metricList$H_tr[4] <- meanTr(res_msfr$hH, datlist$H0List)
metricList$A_tr[4] <- metric_mean(AList=res_msfr$hA, datlist$A0List, align='unaligned', numvarmat = numvarmat)
metricList$B_tr[4] <- mean(ms_metric_mean(res_msfr$hB, datlist$B0List, align='unaligned', numvarmat = numvarmat))
metricList$beta_norm[4] <- normvec(t(res_msfr$hbeta)- Reduce(cbind,datlist$betaList))
metricList$time[4] <- res_msfr$time.use

Fourthly, we also applied the the multi-study covariate-augmented overdispersed Poisson factor (MultiCOAP) model.

res_mcoap <- multicoap_run(XcList=XList, ZList,numvarmat, q, qsvec)
#> iter = 2, ELBO= 11729715485.275578, dELBO=6.462074 
#> iter = 3, ELBO= 11729747226.936024, dELBO=0.000003 
#> iter = 4, ELBO= 11729763609.716249, dELBO=0.000001 
#> iter = 5, ELBO= 11729773001.971945, dELBO=0.000001 
#> iter = 6, ELBO= 11729778867.128466, dELBO=0.000001 
#> iter = 7, ELBO= 11729782776.887302, dELBO=0.000000 
#> iter = 8, ELBO= 11729785519.242598, dELBO=0.000000 
#> iter = 9, ELBO= 11729787522.346491, dELBO=0.000000 
#> iter = 10, ELBO= 11729789034.309420, dELBO=0.000000 
#> iter = 11, ELBO= 11729790206.774778, dELBO=0.000000 
#> iter = 12, ELBO= 11729791136.669062, dELBO=0.000000 
#> iter = 13, ELBO= 11729791888.343462, dELBO=0.000000 
#> iter = 14, ELBO= 11729792505.929955, dELBO=0.000000 
#> iter = 15, ELBO= 11729793020.549742, dELBO=0.000000 
#> iter = 16, ELBO= 11729793454.684090, dELBO=0.000000 
#> iter = 17, ELBO= 11729793824.916128, dELBO=0.000000 
#> iter = 18, ELBO= 11729794143.706781, dELBO=0.000000 
#> iter = 19, ELBO= 11729794420.574137, dELBO=0.000000 
#> iter = 20, ELBO= 11729794662.892561, dELBO=0.000000 
#> iter = 21, ELBO= 11729672354.476299, dELBO=0.000010 
#> iter = 22, ELBO= 11729730057.782242, dELBO=0.000005 
#> iter = 23, ELBO= 11729778379.378492, dELBO=0.000004 
#> iter = 24, ELBO= 11729790503.051077, dELBO=0.000001 
#> iter = 25, ELBO= 11729794380.226938, dELBO=0.000000 
#> iter = 26, ELBO= 11729795686.215660, dELBO=0.000000 
#> iter = 27, ELBO= 11729796153.900805, dELBO=0.000000 
#> iter = 28, ELBO= 11729796344.934231, dELBO=0.000000 
#> iter = 29, ELBO= 11729796443.444094, dELBO=0.000000 
#> iter = 30, ELBO= 11729796509.738033, dELBO=0.000000
metricList$F_tr[5] <- meanTr(res_mcoap$hF, datlist$F0List)
metricList$H_tr[5] <-meanTr(res_mcoap$hH, datlist$H0List)
metricList$A_tr[5] <- metric_mean(AList=res_mcoap$hA, datlist$A0List, align='unaligned', numvarmat = numvarmat)
metricList$B_tr[5] <- mean(ms_metric_mean(res_mcoap$hB, datlist$B0List, align='unaligned', numvarmat = numvarmat))
metricList$beta_norm[5] <- normvec(t(res_mcoap$hbeta)- Reduce(cbind,datlist$betaList))
metricList$time[5] <- res_mcoap$time.use

Comparison of performance

Next, we summarized the metrics for MMGFM and other compared methods in a data.frame object. We observed that MMGFM achieved much better estimation accuracy for the quantities of interest.

mat.metric <- round(Reduce(rbind, metricList),3)
row.names(mat.metric) <- names(metricList)

dat_metric <- as.data.frame(mat.metric)
DT::datatable(dat_metric)

Session Info

sessionInfo()
#> R version 4.4.1 (2024-06-14 ucrt)
#> Platform: x86_64-w64-mingw32/x64
#> Running under: Windows 11 x64 (build 26100)
#> 
#> Matrix products: default
#> 
#> 
#> locale:
#> [1] LC_COLLATE=Chinese (Simplified)_China.utf8 
#> [2] LC_CTYPE=Chinese (Simplified)_China.utf8   
#> [3] LC_MONETARY=Chinese (Simplified)_China.utf8
#> [4] LC_NUMERIC=C                               
#> [5] LC_TIME=Chinese (Simplified)_China.utf8    
#> 
#> time zone: Asia/Tbilisi
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] parallel  stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] MultiCOAP_1.1    psych_2.4.6.26   rrpack_0.1-13    scales_1.3.0    
#>  [5] ggplot2_3.5.1    GFM_1.2.1        doSNOW_1.0.20    snow_0.4-4      
#>  [9] iterators_1.0.14 foreach_1.5.2    MMGFM_1.1.0      irlba_2.3.5.1   
#> [13] Matrix_1.7-0    
#> 
#> loaded via a namespace (and not attached):
#>  [1] GPArotation_2024.3-1 sass_0.4.9           utf8_1.2.4          
#>  [4] generics_0.1.3       shape_1.4.6.1        lattice_0.22-6      
#>  [7] digest_0.6.37        magrittr_2.0.3       evaluate_1.0.0      
#> [10] grid_4.4.1           fastmap_1.2.0        glmnet_4.1-8        
#> [13] jsonlite_1.8.9       survival_3.6-4       fansi_1.0.6         
#> [16] crosstalk_1.2.1      codetools_0.2-20     textshaping_0.4.0   
#> [19] jquerylib_0.1.4      mnormt_2.1.1         cli_3.6.3           
#> [22] rlang_1.1.4          splines_4.4.1        munsell_0.5.1       
#> [25] withr_3.0.1          cachem_1.1.0         yaml_2.3.10         
#> [28] tools_4.4.1          dplyr_1.1.4          colorspace_2.1-1    
#> [31] DT_0.33              vctrs_0.6.5          R6_2.5.1            
#> [34] lifecycle_1.0.4      fs_1.6.4             htmlwidgets_1.6.4   
#> [37] MASS_7.3-60.2        ragg_1.3.3           pkgconfig_2.0.3     
#> [40] desc_1.4.3           pkgdown_2.1.1        bslib_0.8.0         
#> [43] pillar_1.9.0         gtable_0.3.5         glue_1.7.0          
#> [46] Rcpp_1.0.13          statmod_1.5.0        systemfonts_1.1.0   
#> [49] highr_0.11           xfun_0.47            tibble_3.2.1        
#> [52] tidyselect_1.2.1     rstudioapi_0.16.0    knitr_1.48          
#> [55] farver_2.1.2         nlme_3.1-164         htmltools_0.5.8.1   
#> [58] labeling_0.4.3       rmarkdown_2.28       compiler_4.4.1

Wei Liu

2025-02-04

Generate the simulated data

Compare with other methods

Comparison of performance