Joint dimension reduction and spatial clustering

Joint dimension reduction and spatial clustering for scRNA-seq and spatial transcriptomics data

DR.SC_fit(X, K, Adj_sp=NULL, q=15,
             error.heter= TRUE, beta_grid=seq(0.5, 5, by=0.5),
             maxIter=25, epsLogLik=1e-5, verbose=FALSE, maxIter_ICM=6,
             wpca.int=FALSE, int.model="EEE", approxPCA=FALSE, coreNum = 5)

Arguments

X: a sparse matrix with class dgCMatrix or matrix, specify the log-normalization gene expression matrix used for DR-SC model.
K: a positive integer allowing scalar or vector, specify the number of clusters in model fitting.
Adj_sp: an optional sparse matrix with class dgCMatrix, specify the adjoint matrix used for DR-SC model. We provide this interface for those users who would like to define the adjacency matrix by their own.
q: a positive integer, specify the number of latent features to be extracted, default as 15. Usually, the choice of q is a trade-off between model complexity and fit to the data, and depends on the goals of the analysis and the structure of the data. A higher value will result in a more complex model with a higher number of parameters, which may lead to overfitting and poor generalization performance. On the other hand, a lower value will result in a simpler model with fewer parameters, but may also lead to underfitting and a poorer fit to the data.
error.heter: an optional logical value, whether use the heterogenous error for DR-SC model, default as TRUE. If error.heter=FALSE, then the homogenuous error is used for probabilistic PCA model in DR-SC.
beta_grid: an optional vector of positive value, the candidate set of the smoothing parameter to be searched by the grid-search optimization approach.
maxIter: an optional positive value, represents the maximum iterations of EM.
epsLogLik: an optional positive vlaue, tolerance vlaue of relative variation rate of the observed pseudo log-loglikelihood value, defualt as '1e-5'.
verbose: an optional logical value, whether output the information of the ICM-EM algorithm.
maxIter_ICM: an optional positive value, represents the maximum iterations of ICM.
wpca.int: an optional logical value, means whether use the weighted PCA to obtain the initial values of loadings and other paramters, default as FALSE which means the ordinary PCA is used.
int.model: an optional string, specify which Gaussian mixture model is used in evaluting the initial values for DR-SC, default as "EEE"; and see Mclust for more models' names.
approxPCA: an optional logical value, whether use approximated PCA to speed up the computation for initial values.
coreNum: an optional positive integer, means the number of thread used in parallel computating, default as 5. If the length of K is one, then coreNum will be set as 1 automatically.

Details

Nothing

Value

DR.SC_fit returns a list with class "drscObject" with the following three components:

Objdrsc: a list including the model fitting results, in which the number of elements is same as the length of K.
out_param: a numeric matrix used for model selection in MBIC.
K_set: a scalar or vector equal to input argument K.

In addition, each element of "Objdrsc" is a list with the following comoponents:

cluster: inferred class labels
hZ: extracted latent features.
beta: estimated smoothing parameter
Mu: mean vectors of mixtures components.
Sigma: covariance matrix of mixtures components.
W: estimated loading matrix
Lam_vec: estimated variance of errors in probabilistic PCA model
loglik: pseudo observed log-likelihood.

References

Wei Liu, Xu Liao, Yi Yang, Huazhen Lin, Joe Yeong, Xiang Zhou, Xingjie Shi & Jin Liu (2022). Joint dimension reduction and clustering analysis of single-cell RNA-seq and spatial transcriptomics data, Nucleic Acids Research, gkac219.

Author

Wei Liu

Note

nothing

Examples

## we generate the spatial transcriptomics data with lattice neighborhood, i.e. ST platform.
seu <- gendata_RNAExp(height=10, width=10,p=50, K=4)
library(Seurat)
#> Warning: package ‘Seurat’ was built under R version 4.1.3
seu <- NormalizeData(seu, verbose=FALSE)
# choose 40 highly variable features using FindVariableFeatures in Seurat
# seu <- FindVariableFeatures(seu, nfeatures = 40)
# or choose 40 spatailly variable features using FindSVGs in DR.SC
seu <- FindSVGs(seu, nfeatures = 40, verbose=FALSE)
# users define the adjacency matrix
Adj_sp <- getAdj(seu, platform = 'ST')
#> Neighbors were identified for 100 out of 100 spots.
var.features <- seu@assays$RNA@var.features
X <- Matrix::t(seu[["RNA"]]@data[var.features,])
# maxIter = 2 is only used for illustration, and user can use default.
drscList <- DR.SC_fit(X,Adj_sp=Adj_sp, K=4, maxIter=2, verbose=TRUE)
#> Fit DR-SC model...
#> -------------------Calculate inital values-------------
#> Using accurate PCA to obtain initial values
#> -------------------Finish computing inital values------------- 
#> -------------------Starting  ICM-EM algortihm-------------
#> iter = 2, loglik= -1293.974640, dloglik=0.999999 
#> -------------------Complete!-------------
#> elasped time is :0.06
#> Finish DR-SC model fitting