simu_MMGFM.Rmd
This vignette introduces the usage of MMGFM for the analysis of high-dimensional multi-study multi-modality data with additional covariates, by comparison with other methods.
The package can be loaded with the command:
library(MMGFM)
#> Loading required package: irlba
#> Loading required package: Matrix
#> MMGFM : We introduce a generalized factor model designed to jointly analyze high-dimensional multi-modality data from multiple studies by extracting study-shared and specified factors. Our factor models account for heterogeneous noises and overdispersion among modality variables with augmented covariates. We propose an efficient and speedy variational estimation procedure for estimating model parameters, along with a novel criterion for selecting the optimal number of factors. More details can be referred to Liu et al. (2024) <doi:10.48550/arXiv.2408.10542>.
Next, we load some functions for subsequent use.
source("https://raw.githubusercontent.com/feiyoung/MMGFM/refs/heads/main/simu_code/definedFunc.R")
#> Loading required package: doSNOW
#> Loading required package: foreach
#> Loading required package: iterators
#> Loading required package: snow
#> Loading required package: parallel
#>
#> Attaching package: 'parallel'
#> The following objects are masked from 'package:snow':
#>
#> closeNode, clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
#> clusterExport, clusterMap, clusterSplit, makeCluster, parApply,
#> parCapply, parLapply, parRapply, parSapply, recvData, recvOneData,
#> sendData, splitIndices, stopCluster
#> GFM : Generalized factor model is implemented for ultra-high dimensional data with mixed-type variables.
#> Two algorithms, variational EM and alternate maximization, are designed to implement the generalized factor model,
#> respectively. The factor matrix and loading matrix together with the number of factors can be well estimated.
#> This model can be employed in social and behavioral sciences, economy and finance, and genomics,
#> to extract interpretable nonlinear factors. More details can be referred to
#> Wei Liu, Huazhen Lin, Shurong Zheng and Jin Liu. (2021) <doi:10.1080/01621459.2021.1999818>. Check out our Package website (https://feiyoung.github.io/GFM/docs/index.html) for a more complete description of the methods and analyses
First, we generate the simulated data from three data sources and three modalities, with each modality consisting of count variables.
N <- 100
q <- 3
qsvec <- rep(2,3)
sigma_eps <- 1
datlist <- gendata_mmgfm(seed = 1, nvec = c(300, 200, 100),
pveclist = list('poisson'=c(50, 150, 200)),
q = q, d= 3,qs = qsvec, rho = 2, rho_z=0.5,
sigmavec=1, sigma_eps=sigma_eps)
XList <- datlist$XList
max(unlist(XList))
#> [1] 76909083
print(str(XList))
#> List of 3
#> $ :List of 1
#> ..$ : int [1:300, 1:400] 0 2 1 0 4 12 48 296 911 36 ...
#> $ :List of 1
#> ..$ : int [1:200, 1:400] 0 50 11 25 127 1 13 485 0 0 ...
#> $ :List of 1
#> ..$ : int [1:100, 1:400] 32 1 48 50 1 602 2 0 40 166 ...
#> NULL
ZList <- datlist$ZList # covariates
print(head(ZList[[1]]))
#> [,1] [,2] [,3]
#> [1,] 1 1.44901647 1.1020410
#> [2,] 1 0.67563588 1.3507231
#> [3,] 1 0.41854494 1.0426304
#> [4,] 1 1.03062242 -0.5053867
#> [5,] 1 0.50238638 1.3168351
#> [6,] 1 -0.03052584 -0.2719483
tauList <- datlist$tauList # offset term
numvarmat <- datlist$numvarmat
Fit the MMGFM model using the function MMGFM()
in the R
package MMGFM
. Users can use ?MMGFM
to see the
details about this function
system.time({
tic <- proc.time()
reslist <- MMGFM(XList, ZList=ZList, numvarmat, q=q, qsvec = qsvec, init='MSFRVI')
toc <- proc.time()
time_MMGFM <- toc[3] - tic[3]
})
#> Data include 3 studies/sources
#> variables belong to 1 types: poisson
#> Initialization...
#> Initialization using MSFR...
#> Finish Initialization!
#> Initialize the paramters related to s!
#> Initialize the paramters not related to s!
#> iter = 2, ELBO= 11729754929.875647, dELBO=1.000000
#> iter = 3, ELBO= 11729771481.147783, dELBO=0.000001
#> iter = 4, ELBO= 11729781373.544777, dELBO=0.000001
#> iter = 5, ELBO= 11729788256.960657, dELBO=0.000001
#> iter = 6, ELBO= 11729793446.962976, dELBO=0.000000
#> iter = 7, ELBO= 11729797561.288710, dELBO=0.000000
#> iter = 8, ELBO= 11729800893.069698, dELBO=0.000000
#> iter = 9, ELBO= 11729803634.156563, dELBO=0.000000
#> iter = 10, ELBO= 11729805950.517712, dELBO=0.000000
#> iter = 11, ELBO= 11729807974.774397, dELBO=0.000000
#> iter = 12, ELBO= 11729809783.381897, dELBO=0.000000
#> iter = 13, ELBO= 11729811396.493315, dELBO=0.000000
#> iter = 14, ELBO= 11729812805.426003, dELBO=0.000000
#> iter = 15, ELBO= 11729814005.241312, dELBO=0.000000
#> iter = 16, ELBO= 11729815009.644564, dELBO=0.000000
#> iter = 17, ELBO= 11729815847.666649, dELBO=0.000000
#> iter = 18, ELBO= 11729816554.466454, dELBO=0.000000
#> iter = 19, ELBO= 11729817164.732288, dELBO=0.000000
#> iter = 20, ELBO= 11729817710.103460, dELBO=0.000000
#> iter = 21, ELBO= 11729818218.928200, dELBO=0.000000
#> iter = 22, ELBO= 11729818716.410471, dELBO=0.000000
#> iter = 23, ELBO= 11729819223.703918, dELBO=0.000000
#> iter = 24, ELBO= 11729819755.263786, dELBO=0.000000
#> iter = 25, ELBO= 11729820315.107479, dELBO=0.000000
#> iter = 26, ELBO= 11729820894.376608, dELBO=0.000000
#> iter = 27, ELBO= 11729821473.087643, dELBO=0.000000
#> iter = 28, ELBO= 11729822026.580221, dELBO=0.000000
#> iter = 29, ELBO= 11729822533.493935, dELBO=0.000000
#> iter = 30, ELBO= 11729822981.000059, dELBO=0.000000
#> user system elapsed
#> 5.11 0.39 1.79
Check the increased property of the envidence lower bound function.
library(ggplot2)
library(scales)
dat_iter <- data.frame(iter=1:length(reslist$ELBO_seq), ELBO=reslist$ELBO_seq)
ggplot(data=dat_iter, aes(x=iter, y=ELBO)) + geom_line() + geom_point() + theme_bw(base_size = 20) +
scale_y_continuous(labels = label_scientific(digits = 8))
We calculate the metrics to measure the estimation accuracy, where
the mean trace statistic is used to measure the estimation accuracy of
loading matrix and prediction accuracy of factor matrix, which is
evaluated by the function measurefun()
in the R package
GFM
, and the root of mean absolute error is adopted to
measure the estimation error of beta.
methodNames <- c("MMGFM", "GFM", "MRRR", "MSFR", "MultiCOAP")
n_methods <- length(methodNames)
metricList <- list(F_tr = rep(NA, n_methods),
H_tr = rep(NA, n_methods),
V_tr = rep(NA, n_methods),
A_tr = rep(NA, n_methods),
B_tr = rep(NA, n_methods),
beta_norm=rep(NA, n_methods),
time = rep(NA, n_methods))
for(ii in seq_along(metricList)) names(metricList[[ii]]) <- methodNames
metricList$F_tr[1] <- meanTr(reslist$hF, datlist$F0List)
metricList$H_tr[1] <-meanTr(reslist$hH, datlist$H0List)
metricList$V_tr[1] <-meanTr(lapply(reslist$hv, function(x) Reduce(cbind,x) ), datlist$VList)
metricList$A_tr[1] <-metric_mean(AList=reslist$hA, datlist$A0List, align='unaligned', numvarmat = numvarmat)
metricList$B_tr[1] <- mean(ms_metric_mean(reslist$hB, datlist$B0List, align='unaligned', numvarmat = numvarmat))
metricList$beta_norm[1] <-normvec(Reduce(cbind, reslist$hbeta)- Reduce(cbind,datlist$betaList))
metricList$time[1] <- reslist$time.use
We compare MMGFM
with various prominent methods in the
literature. They are (1) Generalized factor model (Liu et al. 2023)
implemented in the R package GFM
; (2) Multi-response
reduced-rank Poisson regression model (MMMR, Luo et al. 2018)
implemented in rrpack
R package; and (3) the multi-study
covariate-augmented overdispersed Poisson factor (MultiCOAP) model.
(1). First, we implemented the generalized factor model (GFM) and record the metrics that measure the estimation accuracy and computational cost.
res_gfm <- gfm_run(XList, numvarmat, q=q)
#> Starting the two-step method with varitional EM in the first step...
#> iter = 2, ELBO= 11729687952.473068, dELBO=1.000000
#> iter = 3, ELBO= 11729712412.413713, dELBO=0.000002
#> iter = 4, ELBO= 11729725556.903790, dELBO=0.000001
#> iter = 5, ELBO= 11729733394.809568, dELBO=0.000001
#> iter = 6, ELBO= 11729738466.541033, dELBO=0.000000
#> iter = 7, ELBO= 11729741961.208078, dELBO=0.000000
#> iter = 8, ELBO= 11729744487.668800, dELBO=0.000000
#> iter = 9, ELBO= 11729746383.293159, dELBO=0.000000
#> iter = 10, ELBO= 11729747847.959942, dELBO=0.000000
#> iter = 11, ELBO= 11729749006.791277, dELBO=0.000000
#> iter = 12, ELBO= 11729749941.723318, dELBO=0.000000
#> iter = 13, ELBO= 11729750708.447094, dELBO=0.000000
#> iter = 14, ELBO= 11729751346.019690, dELBO=0.000000
#> iter = 15, ELBO= 11729751882.570513, dELBO=0.000000
#> iter = 16, ELBO= 11729752338.830721, dELBO=0.000000
#> iter = 17, ELBO= 11729752730.388096, dELBO=0.000000
#> iter = 18, ELBO= 11729753069.170321, dELBO=0.000000
#> iter = 19, ELBO= 11729753364.447330, dELBO=0.000000
#> iter = 20, ELBO= 11729753623.518108, dELBO=0.000000
#> iter = 21, ELBO= 11729753852.200689, dELBO=0.000000
#> iter = 22, ELBO= 11729754055.179434, dELBO=0.000000
#> iter = 23, ELBO= 11729754236.259623, dELBO=0.000000
#> iter = 24, ELBO= 11729754398.560213, dELBO=0.000000
#> iter = 25, ELBO= 11729754544.654888, dELBO=0.000000
#> iter = 26, ELBO= 11729754676.683508, dELBO=0.000000
#> iter = 27, ELBO= 11729754796.436195, dELBO=0.000000
#> iter = 28, ELBO= 11729754905.421116, dELBO=0.000000
#> iter = 29, ELBO= 11729755004.916752, dELBO=0.000000
#> iter = 30, ELBO= 11729755096.012402, dELBO=0.000000
#> Finish the two-step method
metricList$F_tr[2] <- meanTr(res_gfm$hF, datlist$F0List)
metricList$A_tr[2] <-metric_mean(AList=res_gfm$hA, datlist$A0List, align='unaligned', numvarmat = numvarmat)
metricList$time[2] <- res_gfm$time.use
(2). Then, we implemented Multi-response reduced-rank Poisson regression model (MMMR) and recorded the metrics. Here, we truncate values to ensure normal working of this function. Otherwise, it will produce error.
res_mrrr <- mrrr_run(XList, ZList, numvarmat, q, truncflag=TRUE, trunc=500)
#> Loading required package: rrpack
#> Warning: package 'rrpack' was built under R version 4.4.2
#> iter = 1 obj/Cnorm_diff = 49345653
#> iter = 1 obj/Cnorm_diff = 49345653
#> iter = 1 obj/Cnorm_diff = 49345653
#> iter = 1 obj/Cnorm_diff = 49345653
#> iter = 1 obj/Cnorm_diff = 49345653
#> iter = 1 obj/Cnorm_diff = 49345653
#> iter = 1 obj/Cnorm_diff = 49345653
metricList$F_tr[3] <- meanTr(res_mrrr$hF, datlist$F0List)
metricList$A_tr[3] <- metric_mean(AList=res_mrrr$hA, datlist$A0List, align='unaligned', numvarmat = numvarmat)
metricList$beta_norm[3] <-normvec(res_mrrr$hbeta - Reduce(cbind,datlist$betaList))
metricList$time[3] <- res_mrrr$time.use
source("https://raw.githubusercontent.com/feiyoung/MMGFM/refs/heads/main/simu_code/MSFR_main_R_MSFR_V1.R")
## To produce results in limited time, here we set maxIter=5. Even Set maxIter=1e4, the result is also not good.
res_msfr <- MSFR_run(XList, ZList, numvarmat, q, qs=qsvec, maxIter=5, load.source=TRUE, log.transform=TRUE)
#> Loading required package: psych
#>
#> Attaching package: 'psych'
#> The following object is masked _by_ '.GlobalEnv':
#>
#> tr
#> The following objects are masked from 'package:scales':
#>
#> alpha, rescale
#> The following objects are masked from 'package:ggplot2':
#>
#> %+%, alpha
#> Loading required namespace: GPArotation
#> Warning in cor.smooth(r): Matrix was not positive definite, smoothing was done
#> The determinant of the smoothed correlation was zero.
#> This means the objective function is not defined.
#> Chi square is based upon observed residuals.
#> The determinant of the smoothed correlation was zero.
#> This means the objective function is not defined for the null model either.
#> The Chi square is thus based upon observed correlations.
#> In factor.scores, the correlation matrix is singular, the pseudo inverse is used
#> In smc, smcs < 0 were set to .0
#> In smc, smcs < 0 were set to .0
#> In smc, smcs < 0 were set to .0
#> Warning in cor.smooth(r): Matrix was not positive definite, smoothing was done
#> The determinant of the smoothed correlation was zero.
#> This means the objective function is not defined.
#> Chi square is based upon observed residuals.
#> The determinant of the smoothed correlation was zero.
#> This means the objective function is not defined for the null model either.
#> The Chi square is thus based upon observed correlations.
#> In factor.scores, the correlation matrix is singular, the pseudo inverse is used
#> In smc, smcs < 0 were set to .0
#> In smc, smcs < 0 were set to .0
#> In smc, smcs < 0 were set to .0
#> Warning in cor.smooth(r): Matrix was not positive definite, smoothing was done
#> The determinant of the smoothed correlation was zero.
#> This means the objective function is not defined.
#> Chi square is based upon observed residuals.
#> The determinant of the smoothed correlation was zero.
#> This means the objective function is not defined for the null model either.
#> The Chi square is thus based upon observed correlations.
#> In factor.scores, the correlation matrix is singular, the pseudo inverse is used
#> [1] 5
metricList$F_tr[4] <- meanTr(res_msfr$hF, datlist$F0List)
metricList$H_tr[4] <- meanTr(res_msfr$hH, datlist$H0List)
metricList$A_tr[4] <- metric_mean(AList=res_msfr$hA, datlist$A0List, align='unaligned', numvarmat = numvarmat)
metricList$B_tr[4] <- mean(ms_metric_mean(res_msfr$hB, datlist$B0List, align='unaligned', numvarmat = numvarmat))
metricList$beta_norm[4] <- normvec(t(res_msfr$hbeta)- Reduce(cbind,datlist$betaList))
metricList$time[4] <- res_msfr$time.use
res_mcoap <- multicoap_run(XcList=XList, ZList,numvarmat, q, qsvec)
#> iter = 2, ELBO= 11729715485.275578, dELBO=6.462074
#> iter = 3, ELBO= 11729747226.936024, dELBO=0.000003
#> iter = 4, ELBO= 11729763609.716249, dELBO=0.000001
#> iter = 5, ELBO= 11729773001.971945, dELBO=0.000001
#> iter = 6, ELBO= 11729778867.128466, dELBO=0.000001
#> iter = 7, ELBO= 11729782776.887302, dELBO=0.000000
#> iter = 8, ELBO= 11729785519.242598, dELBO=0.000000
#> iter = 9, ELBO= 11729787522.346491, dELBO=0.000000
#> iter = 10, ELBO= 11729789034.309420, dELBO=0.000000
#> iter = 11, ELBO= 11729790206.774778, dELBO=0.000000
#> iter = 12, ELBO= 11729791136.669062, dELBO=0.000000
#> iter = 13, ELBO= 11729791888.343462, dELBO=0.000000
#> iter = 14, ELBO= 11729792505.929955, dELBO=0.000000
#> iter = 15, ELBO= 11729793020.549742, dELBO=0.000000
#> iter = 16, ELBO= 11729793454.684090, dELBO=0.000000
#> iter = 17, ELBO= 11729793824.916128, dELBO=0.000000
#> iter = 18, ELBO= 11729794143.706781, dELBO=0.000000
#> iter = 19, ELBO= 11729794420.574137, dELBO=0.000000
#> iter = 20, ELBO= 11729794662.892561, dELBO=0.000000
#> iter = 21, ELBO= 11729672354.476299, dELBO=0.000010
#> iter = 22, ELBO= 11729730057.782242, dELBO=0.000005
#> iter = 23, ELBO= 11729778379.378492, dELBO=0.000004
#> iter = 24, ELBO= 11729790503.051077, dELBO=0.000001
#> iter = 25, ELBO= 11729794380.226938, dELBO=0.000000
#> iter = 26, ELBO= 11729795686.215660, dELBO=0.000000
#> iter = 27, ELBO= 11729796153.900805, dELBO=0.000000
#> iter = 28, ELBO= 11729796344.934231, dELBO=0.000000
#> iter = 29, ELBO= 11729796443.444094, dELBO=0.000000
#> iter = 30, ELBO= 11729796509.738033, dELBO=0.000000
metricList$F_tr[5] <- meanTr(res_mcoap$hF, datlist$F0List)
metricList$H_tr[5] <-meanTr(res_mcoap$hH, datlist$H0List)
metricList$A_tr[5] <- metric_mean(AList=res_mcoap$hA, datlist$A0List, align='unaligned', numvarmat = numvarmat)
metricList$B_tr[5] <- mean(ms_metric_mean(res_mcoap$hB, datlist$B0List, align='unaligned', numvarmat = numvarmat))
metricList$beta_norm[5] <- normvec(t(res_mcoap$hbeta)- Reduce(cbind,datlist$betaList))
metricList$time[5] <- res_mcoap$time.use
Next, we summarized the metrics for MMGFM and other compared methods in a data.frame object. We observed that MMGFM achieved much better estimation accuracy for the quantities of interest.
mat.metric <- round(Reduce(rbind, metricList),3)
row.names(mat.metric) <- names(metricList)
dat_metric <- as.data.frame(mat.metric)
DT::datatable(dat_metric)
sessionInfo()
#> R version 4.4.1 (2024-06-14 ucrt)
#> Platform: x86_64-w64-mingw32/x64
#> Running under: Windows 11 x64 (build 26100)
#>
#> Matrix products: default
#>
#>
#> locale:
#> [1] LC_COLLATE=Chinese (Simplified)_China.utf8
#> [2] LC_CTYPE=Chinese (Simplified)_China.utf8
#> [3] LC_MONETARY=Chinese (Simplified)_China.utf8
#> [4] LC_NUMERIC=C
#> [5] LC_TIME=Chinese (Simplified)_China.utf8
#>
#> time zone: Asia/Tbilisi
#> tzcode source: internal
#>
#> attached base packages:
#> [1] parallel stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] MultiCOAP_1.1 psych_2.4.6.26 rrpack_0.1-13 scales_1.3.0
#> [5] ggplot2_3.5.1 GFM_1.2.1 doSNOW_1.0.20 snow_0.4-4
#> [9] iterators_1.0.14 foreach_1.5.2 MMGFM_1.1.0 irlba_2.3.5.1
#> [13] Matrix_1.7-0
#>
#> loaded via a namespace (and not attached):
#> [1] GPArotation_2024.3-1 sass_0.4.9 utf8_1.2.4
#> [4] generics_0.1.3 shape_1.4.6.1 lattice_0.22-6
#> [7] digest_0.6.37 magrittr_2.0.3 evaluate_1.0.0
#> [10] grid_4.4.1 fastmap_1.2.0 glmnet_4.1-8
#> [13] jsonlite_1.8.9 survival_3.6-4 fansi_1.0.6
#> [16] crosstalk_1.2.1 codetools_0.2-20 textshaping_0.4.0
#> [19] jquerylib_0.1.4 mnormt_2.1.1 cli_3.6.3
#> [22] rlang_1.1.4 splines_4.4.1 munsell_0.5.1
#> [25] withr_3.0.1 cachem_1.1.0 yaml_2.3.10
#> [28] tools_4.4.1 dplyr_1.1.4 colorspace_2.1-1
#> [31] DT_0.33 vctrs_0.6.5 R6_2.5.1
#> [34] lifecycle_1.0.4 fs_1.6.4 htmlwidgets_1.6.4
#> [37] MASS_7.3-60.2 ragg_1.3.3 pkgconfig_2.0.3
#> [40] desc_1.4.3 pkgdown_2.1.1 bslib_0.8.0
#> [43] pillar_1.9.0 gtable_0.3.5 glue_1.7.0
#> [46] Rcpp_1.0.13 statmod_1.5.0 systemfonts_1.1.0
#> [49] highr_0.11 xfun_0.47 tibble_3.2.1
#> [52] tidyselect_1.2.1 rstudioapi_0.16.0 knitr_1.48
#> [55] farver_2.1.2 nlme_3.1-164 htmltools_0.5.8.1
#> [58] labeling_0.4.3 rmarkdown_2.28 compiler_4.4.1