## Abstract

The advancement in technologies and the growth of available single-cell datasets motivate integrative analysis of multiple single-cell genomic datasets. Integrative analysis of multimodal single-cell datasets combines complementary information offered by single-omic datasets and can offer deeper insights on complex biological process. Clustering methods that identify the unknown cell types are among the first few steps in the analysis of single-cell datasets, and they are important for downstream analysis built upon the identified cell types. We propose scAMACE for the integrative analysis and clustering of single-cell data on chromatin accessibility, gene expression and methylation. We demonstrate that cell types are better identified and characterized through analyzing the three data types jointly. We develop an efficient expectation-maximization (EM) algorithm to perform statistical inference, and evaluate our methods on both simulation study and real data applications. We also provide the GPU implementation of scAMACE, making it scalable to large datasets. The software and datasets are available at https://github.com/cuhklinlab/scAMACE_py (python implementation) and https://github.com/cuhklinlab/scAMACE (R implementation).

## 1 Introduction

Recent developments in single-cell technologies enable multiple measurements of different genomic features (Lahnemann *et al*., 2020). Sequencing technologies include single-cell RNA sequencing (scRNA-seq) which measures transcription, single-cell ATAC sequencing (scATAC-seq) and the assay based on combinatorial indexing (sci-ATAC-seq) (Cusanovich *et al*., 2018b) that measure chromatin accessibility, and single-nucleus methylcytosine sequencing (snmC-seq) (Luo *et al*., 2017) which meansures methylome at the single cell resolution. High technical variation is presented in single-cell datasets due to the limited amount of genomic materials and the experimental procedures to amplify the signals (Lahnemann *et al*., 2020).

Because cell types are usually unknown beforehand, clustering methods are needed to identify the cell types. Majority of existing clustering algorithms only take one single dataset as input. Beside the widely used K-Means clustering algorithm, hierarchical clustering (Ward, 1963) forms hierarchical groups of mutually exclusive subsets on the basis of their similarity with respect to specified characteristics by considering the union of all possible pairs and accepting the union with which an optimal value of the objective function is associated. Spectral Clustering (Ng *et al*., 2001) uses the top *k* eigenvectors of a matrix derived from the distance between points simultaneously for clustering. Several algorithms are developed specifically for scRNA-seq data. SC3 (Kiselev *et al*., 2017) combines multiple clustering outcomes through a consensus approach. SIMLR (Wang *et al*., 2017) learns a distance metric by multiple kernels and clusters with affinity propagation. CIDR (Lin *et al*., 2017) imputes the gene expression profiles, calculates the dissimilarity based on the imputed gene expression profiles for every pair of single cells, performs principal coordinate analysis using the dissimilarity matrix, and finally performs clustering using the first few principal coordinates. SOUP (Zhu *et al*., 2019) semi-softly classifies both pure and intermediate cell types: it first identifies the set of pure cells by special block structure and estimates a membership matrix, then estimates soft membership for the other cells. For the analysis of single-cell chromatin accessibility data, scABC (Zamanighomi *et al*., 2018) first weights cells and applies weighted K-medoids clustering, then calculate landmarks for each cluster, and finally clusters the cells by assignment to the closest landmark based on Spearman correlation. Cusanovich (Cusanovich *et al*., 2018a) makes use of singular value decomposition on TF-IDF transformed matrix and density peak clustering algorithm. cisTopic (Bravo Gonzá lez-Blas *et al*., 2019) uses latent Dirichlet allocation with a collapsed Gibbs sampler to iteratively optimize the region-topic distribution and the topic-cell distribution. SCALE (Xiong *et al*., 2019) combines the variational autoencoder framework with the Gaussian Mixture Model which extracts latent features that characterize the distributions of input scATAC-seq data, and then uses the latent features to cluster cell mixtures into subpopulations. Clustering methods are also developed for single-cell methylation data. BPRMeth (Kapourani and Sanguinetti, 2016) uses probabilistic machine learning to extract higher order features across a defined region and to cluster promoter-proximal regions by Binomial distributed probit regression (BPR) and mixture modeling. PDclust (Hui *et al*., 2018) leverages the methylation state of individual CpGs to obtain pairwise dissimilarity (PD) values, and calculates Euclidean distances between each pair of cells using their PD values and performed hierarchical clustering. Melissa (Kapourani and Sanguinetti, 2019) implements a Bayesian hierarchical model that jointly learns the methylation profiles of genomic regions of interest and clusters cells based on their genome-wide methylation patterns. pCSM (Yin *et al*., 2019) implements a semi-reference-free procedure to perform virtual methylome dissection using the nonnegative matrix factorization algorithm. It first determines putative cell-type-specific methylated loci and then clusters the loci into groups based on their correlations in methylation profiles.

Studies based on single-omic data provide only a partial landscape of the entire cellular heterogeneity (Ma *et al*., 2020). High technical noise and the growth of available datasets measuring different genomic features encourage integrative analysis (Lahnemann *et al*., 2020). By combining complementary information from multiple datasets, the cell types may be better seperated and characterized (Corces *et al*., 2016; Duren *et al*., 2017). The integrative analysis of gene expression and chromatin activity may better define cell types and lineages, especially in complex tissues (Duren *et al*., 2018). Seurat V3 (Stuart *et al*., 2019) uses Canonical Correlation Analysis (CCA) to reduce the dimension of the datasets. It identifies the pairwise correspondences of single cells across datasets, termed ‘anchors’, and then transfers labels from a reference dataset onto a query dataset. coupleNMF (Duren *et al*., 2018) is based on the coupling of two non-negative matrix factorizations, where a ‘soft’ clustering can be obtained following the matrix factorizations. It enables integrative analysis of scRNA-seq and scATAC-seq data. LIGER (Welch *et al*., 2019) integrates multimodal datasets via integrative non-negative matrix factorization (iNMF) to learn a low-dimensional space defined by dataset-specific factors and shared factors across datasets, and then build a neighborhood graph based on the shared factors to identify joint clusters by performing community detection on this graph. scACE (Lin *et al*., 2020) is a model-based approach that jointly analyzes single-cell chromatin accessibility and scRNA-Seq data, and it quantifies the uncertainty of cluster assignments. MAESTRO (Wang *et al*., 2020) integrates scRNA-seq and scATAC-seq data from multiple platforms. It also provides comprehensive functions for pre-processing, alignment, quality control, and quantification of expression and accessibility. coupleCoC (Zeng *et al*., 2020) performs co-clustering of the cells and the features simultaneously in the source data and the target data, and it also matches the cell clusters between the source data and the target data through minimizing the distribution divergence. scMC (Zhang and Nie, 2021) integrates multiple scRNA-Seq datasets or multiple scATAC-Seq datasets, where it learns biological variation via variance analysis to subtract technical variation inferred in an unsupervised manner. The three data types, including gene expression, chromatin accessibility and methylation, have distinct characteristics and complex relationships with each other. The aforementioned methods for integrative analysis are not designed to integrate all three data types. Moreover, these methods (except scACE) do not provide statistical inference on the cluster assignments, which may be important when there are cells at the intermediate stages during development.

In this work, we extend scACE (Lin *et al*., 2020) to scAMACE (integrative Analysis of single-cell Methylation, chromatin ACcessibility, and gene Expression). scAMACE considers the biological and technical variabilities when integrating multiple data types, and it can provide statistical inference on the assignment of clusters. We reason that by combining complementary biological information from multiple data types, better cell type seperation can be achieved. We present our model in Section 2, and statistical inference using the Expectation-Maximization (EM) algorithm in Section 3. Simulation study and real data applications are presented in Sections 4 and 5, respectively. The conclusion is presented in Section 6.

## 2 Methods

An overview of scAMACE is presented in Fig. 1.

### 2.1 Model for scRNA-Seq data

The model specification for scRNA-Seq data is as the following.

We assume that there are *K* cell clusters in total, the random variable *z*_{lk} denotes whether cell *l* belongs to cluster *k* ∈ {1, …, *K*}, and *z*_{l·} follows categorical distribution with probability for cluster *k*.

denotes the probability that gene *g* is active in cluster *k. u*_{lg} is a binary latent variable representing whether gene *g* is active in cell *l* and *u*_{lg} = 1 represents that it is active. *v*_{lg} denotes whether gene *g* is expressed in cell *l* and *v*_{lg} = 1 represents that it is expressed.

When gene *g* is active in cell *l* (*u*_{lg} = 1), the probability that gene *g* is expressed in cell *l* (*v*_{lg} = 1) is *π*_{l1}, while the probability that gene *g* is expressed is *π*_{l0} if the gene is not active (*u*_{lg} = 0). Since genes are more likely to be expressed when they are active, we assume that *π*_{l1} ≥ *π*_{l0} and the prior distributions of *π*_{l1} and *π*_{l0} are assumed to be flat.

Let *y*_{lg} denote the observed gene expression for gene *g* in cell *l* (after normalization to account for sequencing depth and gene length), and we assume that *y*_{lg} | *v*_{lg} follows a mixture distribution, where *g*_{1}(.) and *g*_{0}(.) are density functions of the expression level conditional on *v*_{lg}.

### 2.2 Model for single-cell chromatin accessibility (scCAS) data

The model specification for scCAS data is as the following.

The random variables and *u*_{ig} have similar interpretations to their corresponding variables in the model for scRNA-Seq data. We use a different notation *i* to represent that the cells in the scCAS data are different from the cells in the scRNA-Seq data.

*x*_{ig} denotes the observed gene score for gene *g* in cell *i*. The gene score summarizes the accessibility of the regions around the gene body (Cusanovich *et al*., 2018a). We model it by a mixture distribution with density functions *f*_{1}(.), *f*_{0}(.), and binary latent variable *o*_{ig}. *o*_{ig} = 1, and 0 represent the mixture components with high (*f*_{1}) and low (*f*_{0}) gene scores, respectively. Accessibility tends to be positively associated with activity of the gene. We model this positive relationship by the distribution *o*_{ig}|*u*_{ig}. When gene *g* is active in cell *i* (*u*_{ig} = 1), the probability that it has high gene score (*o*_{ig} = 1) is *π*_{i1}; When gene *g* is inactive in cell i (*u*_{ig} = 0), the probability that it has high gene score (*o*_{ig} = 1) is *π*_{i0}. We assume that *π*_{i1} ≥ *π*_{i0} to represent the positive relationship. In practice, we found that fixing *π*_{i0} = 0 leads to good real data performance, and we set *π*_{i0} = 0 by default. The prior distribution *π*_{i1} ∼ *Beta*(*α* = 1, *β* = 1). In real data example 1, the observed data is promoter accessibility and we use the same model as that for gene score.

We assume that follows Beta distribution with mean and precision *ϕ*^{acc}. The variable is connected with in scRNA-Seq data through the logit function: . Details on the specification of *f* (·) are presented in Section 2.6.

### 2.3 Model for single-cell methylation data

The model specification for sc-methylation data is as the following.

The random variables , and *u*_{dg} have similar interpretations to their corresponding variables in the model for scRNA-Seq data. We use a different notation *d* to represent that the cells in the sc-methylation data are different from the cells in the scRNA-Seq data.

The binary random variable *m*_{dg} denotes whether gene *g* is methylated in cell *d*, and *m*_{dg} = 1 represents that it is methylated. Methylation of a gene (promoter methylation/gene body methylation) tends to be negatively associated with activity of the gene, and we model this negative relationship with the model *m*_{dg} | *u*_{dg}: when the gene *g* is active in cell *d* (*u*_{dg} = 1), it is less likely to be methylated (*m*_{dg} = 1), as we assume that *π*_{d1} ≤ *π*_{d0}.

*t*_{dg} denotes the observed methylation level for gene *g* in cell *d*, and we assume that *t*_{dg} | *m*_{dg} follows a mixture distribution, where *h*_{1}(.) and *h*_{0}(.) are density functions conditional on *m*_{dg}. The technologies/features differ for the two real data applications to be presented: promoter methylation for the gene (Pott, 2017), and gene body methylation at non-CG sites (Luo *et al*., 2017).

Similar to scCAS data, we connect , which is the mean of , and through the logit function: . Details on specification of *g*(·) are presented in Section 2.6.

### 2.4 More on model specification

Methylation and chromatin accessibility regulate gene expression biologically. Our model is specified in the reverse order, so gene expression plays a central role. This is because scRNA-Seq data is usually less noisy compared with scCAS data and sc-methylation data, the model specified this way will improve the clustering performance of scCAS data and sc-methylation data, without sacrificing much the clustering performance of scRNA-Seq data.

### 2.5 Prior specifications

We assume the following priors for .

The prior specification *Beta* (*α* = 2, *β* = 2) improves the stability of the EM algorithm in Section 3 over uniform distritbution.

### 2.6 Determination of and *ϕ*^{met}

We assume that and . The parameters {*η, γ, τ, δ, θ, ϕ*^{acc}, *ϕ*^{met}} are estimated empirically from the datasets. We first set the number of clusters *K* = 1 and use the model to estimate and separately without considering the links on ** ω** across the three datasets, and then fix and to estimate {

*η, γ, τ, δ, θ, ϕ*

^{acc},

*ϕ*

^{met}} by beta regression (Silvia and Francisco, 2004). The rationale for fixing

*K*= 1 to estimating the parameters in the functions

*f*(.) and

*g*(.) that link the three modalities is that the majority of the features may not change much across the cell types. We fix when implementing the EM algorithm in Section 3. Estimating {

*η, γ, τ, δ, θ, ϕ*

^{acc},

*ϕ*

^{met}} separately from the EM algorithm improves computational efficiency and avoids problematic local modes. Distributions of v.s. and v.s. for the two real data applications are presented in Supplementary Materials Figures S.4 and S.5, we can see from Figures S.4 and S.5 that the linear and quadratic models capture the trends on how and changes with .

### 2.7 The mixture components

For scCAS data, we apply *f*_{1}(*x*) = 0, *f*_{0}(*x*) = 1 if *x* = 0 and *f*_{1}(*x*) = 1, *f*_{0}(*x*) = 0 if *x >* 0, due to the sparsity of the data matrix.

For scRNA-Seq data, we first normalize read counts to TPM (transcripts per million) or FPKM (fragments per kilobase of exon model per million reads mapped) to account for sequencing depth and gene length, then fit a two-component gamma mixture model for the nonzero entries, through pooling ln(TPM+1) or ln(FPKM+1) over all the samples, and then the remaining zero entries are merged with the mixture component that has a smaller mean. The log transformation takes into account the very large values in the data matrix.

sc-methylation data represents the proportion of methylated sites within a given genomic interval, where the entries in the data matrix take values between 0 and 1. In the two real data applications, majority of entries in the data matrix take small values, for the methylation data in each cell, we first divide the entries by (1 — entries) to map them into [0, ∞). We then normalize the entries by dividing the median of non-zero entries in each cell, and then take square of the entries to boost the signals. This transformation helps to align the three modalities and it improves the clustering results significantly (Supplementary Materials Tables S.5 and S.6). Because the transformed entries represent the relative evidence of the methylation status, we input the transformed entries directly as the ratio in the EM algorithm. Histograms for the distributions of the sc-methylation data are presented in the Supplementary Materials Figure S.1.

### 2.8 Feature selection

scRNA-Seq data is usually the least noisy data type, compared with scCAS and sc-methylation data. We use scRNA-Seq data for feature selection before implementing scAMACE. We first cluster scRNA-Seq data with SC3 and then use the cluster assignments to select top 1,000 features with large mean shift across different clusters. More specifically, denote the data matrix as *X*_{n×p} (*x*_{ij} denotes the observation for the *i*-th cell and *j*-th feature), the cluster assignments as *L*_{n×1} (*l*_{i} = *k* denotes that the *i*-th cell belongs to the *k*-th cluster) and total number of clusters as *K*. For feature *j*, we first calculate the difference between the mean of the cells within one cell type and the mean of cells in other cell types; the differences are represented as ** D**(

*j*) = (

*d*

_{1j}, …,

*d*

_{Kj}), where . We take the maximum entry in

*D*(

*j*):

*m*(

*j*) = max

_{k}

**(**

*D**j*). When

*m*(

*j*) is large, it represents that feature

*j*has high expression in one cluster, compared with all other clusters. Finally, we select the top 1,000 features with highest values in

*m*(

*j*).

### 2.9 Determination of the number of clusters *K*

We determine the number of clusters *K* for the three single-cell datasets seperately before we apply scAMACE. We first run K-Means for each *K* and calculate the average silhouette width of observations (Kaufman and Rousseeuw, 1990). Silhouette width measures how well an observation has been classified. For each observation *i*, the silhouette value *s*(*i*) is calculated as follows. First denote by *A* the cluster to which observation *i* has been assigned and then calculate

*a*(*i*) = average Euclidean distance of *i* to all other objects of *A*.

Now consider any cluster *C* different from *A* and define

*d*(*i, C*) = average Euclidean distance of *i* to all objects of *C*.

Then . When cluster *A* contains only a single observation, we simply set *s*(*i*) = 0. The average of *s*(*i*) for *i* = 1, 2, …, *n* is denoted by , and it is called the average silhouette width for the entire data set. is used for the selection of *K*. Higher value in indicates better clustering outcome. We select *K* that has the maximum average silhouette width. Details for selecting *K* in the two real data applications are presented in Supplementary Materials Figures S.2 and S.3. When the similarity of the cell types is high, the Silhoutte method may choose a smaller *K* than the number of cell types (Figure S.3), and we may choose a larger *K* instead.

## 3 Statistical inference: EM algorithm

Given the observed scCAS data ** X**, scRNA-Seq data

**, and sc-methylation data**

*Y***, we treat the latent variables**

*T***Γ**= {

**,**

*Z, U***,**

*O, V***} as missing data, and use the Expectation-Maximization (EM) algorithm to estimate the parameters**

*M***Φ**={

*ψ*^{acc},

*ω*^{acc},

*π*_{i},

*ψ*^{rna},

*ω*^{rna},

*π*_{l},

*ψ*^{met},

*ω*^{met},

*π*_{d}}. The Q-function is

**(**

*Q***Φ**|

**Φ**

_{old}) = 𝔼

_{old}(

*ln*(

**(**

*P***Φ, Γ**|

*obs*.))), where the expectation is over

**Γ**under distribution

**(**

*P***Γ**|

**Φ**

^{old},

*obs*.).

In the M-step, we maximize ** Q**(

**Φ**|

**Φ**

_{old}) with respect to

**Φ**and update parameters as follows.

We use grid search to update because its optimal value does not have an explicit form.

We iterate between E-step and M-step until converge. 𝔼 (*Z*_{i.}), 𝔼 (*Z*_{l.}) and 𝔼 (*Z*_{d.}) in the last iteration are used for clustering. Details for the derivations are presented in the Supplementary Materials.

## 4 Simulation studies

To validate scAMACE, we generated three different types of simulated data ** x, y** and

**following the model assumption. In the simulated data, the sample sizes**

*t**n*

_{x}= 900,

*n*

_{y}= 1100, and

*n*

_{t}= 1000. The number of features

*p*= 1000. The number of clusters

*K*

_{x}=

*K*

_{y}=

*K*

_{t}= 3, and ., and

*ϕ*

^{t}= 10. The detailed simulation scheme is presented in the Supplementary Materials.

For the first data type, x, we set *f*_{1}(*x*) = 0 if *x* = 0, and *f*_{0}(*x*) = 0 if *x* = 1. We fit a two-component gamma mixture model for ** y** using ‘gammamixEM’ in R (Young

*et al*., 2019) and beta mixture model for

**using ‘betamix’ in R (Cribari-Neto and Zeileis, 2010; Grun**

*t**et al*., 2012) to estimate the mixture densities. We apply the method in Section 2.6 to estimate parameters in and . We then implement scAMACE using the estimated densities and .

We use purity, rand index, adjusted rand index and normalized mutual information to evaluate the clustering results. We implement scAMACE either on the three data types seperately (‘scAMACE (seperate)’) without borrowing information or jointly (‘scAMACE (joint)’). Table 1 presents the simulation results. We also compared scAMACE with other exsiting methods under four additional simulation schemes: imbalanced datasets where the number of cells varies across the three datasets (Supplementary Materials Table S.1), unequal numbers of clusters in the three datasets (Supplementary Materials Table S.2), imbalanced cluster sizes (Supplementary Materials Table S.3) and smaller number of features (Supplementary Materials Table S.4). scAMACE performs the best compared with the other methods in the above simulation settings. This is likely due to integration of information from all three data sets.

In the following two real data applications, we apply methods mentioned in Section 2.7 instead of fitting a beta mixture model to sc-methylation data.

## 5 Application to real data

### 5.1 Application 1: K562 and GM12878 scRNA-Seq, scATAC-Seq and sc-methylation data

We evaluate scAMACE by jointly clustering scRNA-Seq, scATAC-Seq and sc-methylation data generated from two cell types, K562 and GM12878 (Buenrostro *et al*., 2015; Li *et al*., 2017; Pott, 2017). We set *K* = 2, and use the true cell labels as a benchmark to evaluate the performance of the clustering methods. Table S.5 presents the clustering results. scAMACE performs well in seperating the cell types. scRNA-Seq is perfectly seperated, while there are only three cells that are not classified correctly in the sc-methylation dataset and eleven misclassifications in the scATAC-Seq dataset. In addition, the two cell types are correctly matched across the three datasets. Compared with the clustering results given by implementing scAMACE seperately on the three datasets, jointly clustering the three datasets improves the overall clustering performance, especially for scATAC-Seq data, which is likely due to the integration of information across the three datasets.

We compared scAMACE with Seurat V3 (Stuart *et al*., 2019), LIGER (Welch *et al*., 2019) and scMC (Zhang and Nie, 2021), which are methods for integrative analysis of single-cell data. Examples were presented in Seurat V3 (Stuart *et al*., 2019) where scRNA-Seq and scATAC-Seq data were integrated. So we implemented Seurat V3 to integrate these two data types. Seurat V3 did not perform well for scATAC-Seq data (Table S.5). Seurat V3 is not applicable to integrate sc-methylation data with the other two datasets. Examples were presented in LIGER (Welch *et al*., 2019) where scRNA-Seq data and sc-methylation data were integrated. So we implemented LIGER to integrate these two data types. LIGER did not perform well on sc-methylation data (Table S.5). We also implemented LIGER to integrate all three datasets, and LIGER still did not perform well on sc-methylation data (Supplementary Materials Table S.7 and S.8), this may be due to the small sample size in sc-methylation data. scMC (Zhang and Nie, 2021) was developed for the integrative analysis of multiple single-cell datasets with the same data type. Since the features in scATAC-Seq data, scRNA-Seq data and sc-methylation data are linked, scMC can be implemented in principle. scMC did not perform well on scATAC-Seq data and sc-methylation data (Supplementary Materials Table S.7 and S.8). This may be due to the fact that the characteristics of different data types are very different, and ignoring the difference leads to suboptimal performance.

### 5.2 Application 2: Mouse neocortex scRNA-Seq, sci-ATAC-Seq and sc-methylation data

In this example, we evaluate scAMACE for the joint analysis of single-cell datasets where the cell types are different across the datasets.

We collected single-cell datasets generated from mouse neocortex. There are five cell types in scRNA-Seq data (Tasic *et al*., 2018), including astrocytes (Astro), glutamatergic neurons in layer 4 (L4), corti-cothalamic glutamatergic neurons in layer 6 (L6 CT), oligodendrocytes (Oligo) and Pvalb+ GABAergic neurons (Pvalb). There are three cell types in sci-ATAC-Seq data (Cusanovich *et al*., 2018b), including astrocytes (Astro), excitatory neurons CPN (Ex. neurons CPN), and oligodendrocytes (Oligo). There are three cell types in sc-methylation dataset (Luo *et al*., 2017), including excitatory neurons in layer 4 (L4), excitatory neurons in layer 6 (labeled as L6-2 in (Luo *et al*., 2017)), and Pvalb+ GABAergic neurons (Pvalb). In the three datasets, the optimal numbers of clusters chosen by the Silhoutte method, , tend to be smaller than the numbers of cell types, which is likely due to the similarity of the neuronal subtypes. We set K=5 when we implement scAMACE, instead of the value given by the Silhoutte method. The true cell labels are used as a benchmark for evaluating the performance of the clustering methods.

The clustering results are presented in Table S.6. Even though *K* is larger than the number of cell types in sci-ATAC-Seq data and sc-methylation data, scAMACE still determines the correct number of cell types in sci-ATAC-Seq data. Although the cells in sc-methylation data fall into four clusters, there are only seven cells in cluster 4. Cell types in all three datasets are well seperated. Astrocytes and oligodendrocytes are matched across scRNA-Seq data and sci-ATAC-Seq data. Excitatory neurons CPN in sci-ATAC-Seq data are matched with glutamatergic neurons in layer 4 in the scRNA-Seq data. We note that most excitatory neurons are glutamatergic neurons. Excitatory neurons in layers 4 and 6, and Pvalb+ GABAergic neurons are matched between scRNA-Seq data and sc-methylation data.

Compared with implementing scAMACE on the three datasets separately, the joint analysis leads to improvement in clustering, especially for sc-methylation dataset. This is likely because the joint model borrows information across the three datasets. Similar to application 1, we implemented Seurat V3 to integrate scRNA-Seq and sci-ATAC-Seq data. Seurat V3 (Stuart *et al*., 2019) does not perform well on sci-ATAC-Seq data (Table S.6). We implemented LIGER (Welch *et al*., 2019) to integrate scRNA-Seq and sc-methylation data. LIGER does not seperate excitatory neurons in layer 4 and layer 6 in sc-methylation data (Table S.6). We also integrated all three datasets by LIGER (Welch *et al*., 2019) and scMC (Zhang and Nie, 2021). LIGER and scMC did not perform well (Supplementary Materials Table S.9 and S.10). Overall, scAMACE performed the best compared with the other methods.

### 5.3 Computational cost

LIGER, Seurat V3 and scMC only provide the versions that are implemented on CPU, while scAMACE can be implemented on both CPU and GPU. We summarized the computational time for scAMACE (CPU version and GPU version in python), LIGER (Welch *et al*., 2019), Seurat V3 (Stuart *et al*., 2019) and scMC (Zhang and Nie, 2021) (Supplementary Materials Tables S.11, S.12 and S.13). We implemented scAMACE, LIGER and scMC to cluster the three types of data simultaneously, and we implemented Seurat V3 to cluster scCAS data and scRNA-Seq data.

On real data application 2 (∼ 8,000 cells), the computational time for scAMACE are 418.858 seconds on one 3.4GHz Intel Xeon Gold CPU and 69.652 seconds on one 3.1GHz Dual Intel Xeon Gold GPU. Compared with LIGER (80.389 seconds on one 3.4GHz Intel Xeon Gold CPU), scMC (372.323 seconds on one 3.4GHz Intel Xeon Gold CPU) and Seurat V3 (116.688 seconds for scRNA-Seq and sci-ATAC-Seq data on one 3.4GHz Intel Xeon Gold CPU), scAMACE has competitive computational speed, especially the GPU version.

Next, we generated a dataset with sample size=30,000 (*n*_{acc} = *n*_{rna} = *n*_{met} = 10, 000) by sampling the cells with replacement from real data application 2. The computational time for scAMACE are 1534.631 seconds on one 3.4GHz Intel Xeon Gold CPU and 250.089 seconds on one 3.1GHz Dual Intel Xeon Gold GPU. Compared with LIGER (555.574 seconds on one 3.4GHz Intel Xeon Gold CPU), scMC (3667.878 seconds on one 3.4GHz Intel Xeon Gold CPU) and Seurat V3 (290.640 seconds for scRNA-Seq and sci-ATAC-Seq data on one 3.4GHz Intel Xeon Gold CPU), scAMACE has competitive computational speed on datasets with larger scale.

## 6 Conclusion

Unsupervised methods including dimension reduction and clustering are essential to the analysis of single-cell genomic data as the cell types are usually unknown. We have developed scAMACE, a model-based approach for integratively clustering single-cell data on chromatin accessibility, gene expression and methylation. scAMACE provides statistical inference of cluster assignments and achieves better cell type seperation combining biological information across different types of genomic features. In the two real data applications, the scRNA-Seq data are generated from the SMART-Seq platform (Li *et al*., 2017; Tasic *et al*., 2018). To implement scAMACE on UMI-based scRNA-Seq data (10x data), we may need to modify the distributions of the mixture components *g*_{0}(·) and *g*_{1}(·). The cells in our real data examples are differentiated and mature cells. In the future, we will investigate the performance of scAMACE on immature cells undergoing differentiation.

## Funding

This work has been supported by the Chinese University of Hong Kong direct grants (4053360, 4053423), the Chinese University of Hong Kong startup grant (4930181), the Chinese University of Hong Kong’s Project Impact Enhancement Fund (PIEF) and Science Faculty’s Collaborative Research Impact Match-ing Scheme (CRIMS), and Hong Kong Research Grant Council (ECS 24301419, GRF 14301120).

## SUPPLEMENTARY MATERIALS

### S.1 Supplementary Text

#### S.1.1 Joint likelihood

scCAS data

scRNA-Seq data

sc-methylation data

#### S.1.2 Q-function

Let **Γ** denote the missing data, and let **Φ** denote the parameters. the Q-function is ** Q**(

**Φ**|

**Φ**

_{old}) = 𝔼

_{old}(

*ln*(

**(**

*P***Φ, Γ**|

*obs*.))), where the expectation is over

**Γ**under distribution

**(**

*P***Γ**|

**Φ**

^{old},

*obs*.) :=

*P*_{old}(

**Γ**). where , and

**is a constant that does not depend on the parameters.**

*C*#### S.1.3 Expectations in E-Step

scCAS data

scRNA-Seq data

sc-methylation data

#### S.1.4 Simulation scheme

We generated three different types of simulated data ** x, y** and

**following the model assumption. In the simulated data, the sample sizes**

*t**n*

_{x}= 900,

*n*

_{y}= 1100, and

*n*

_{t}= 1000. The number of features

*p*= 1000. The numbers of clusters

*K*

_{x}=

*K*

_{y}=

*K*

_{t}= 3. ,

*Φ*

^{x}= 10 and

*Φ*

^{t}= 10. The followings are the simulation scheme:

##### A. Generate *ω*^{y}

*ω*

For *g* = 1, …, 150:

We set *ω* = 0.8.

For *g* = 151, …, 1000:

To summarize, we set the first 150 features to be differential and for the remaining 151, …, *p* features, we set to be the same across different clusters *k*.

##### B. Generate *ω*^{x} and *ω*^{t}

*ω*

*ω*

For *g* = 1, …, 150:

For *g* = 151, …, 1000:

For *g* = 1, …, 150:

For *g* = 151, …, 1000:

##### C. Generate *z*^{x}, *z*^{y} and *z*^{t}

*z*

*z*

*z*

The cluster labels are generated with equal probability, .

##### D. Data type 1: *x*

*x*

Generate

*u*^{x}. We generate*u*_{ig}from*Bernoulli*if*z*_{ik}= 1.Generate

*o*^{x}. We generate*o*_{ig}from*Bernoulli*(*π*_{i1}) if*u*_{ig}= 1, and set*o*_{ig}= 0 if*u*_{ig}= 0. We set*π*_{i1}= 0.2 for*i*= 1, · · ·,*n*_{x}.Generate

. We generate*x**x*_{ig}= 1 if*o*_{ig}= 1, and generate*x*_{ig}= 0 if*o*_{ig}= 0.

##### E. Data type 2: *y*

*y*

Generate

*u*^{y}. We generate*u*_{lg}from*Bernoulli*if*z*_{lk}= 1.Generate

*v*^{y}. We generate*v*_{lg}from*Bernoulli*(*π*_{l1}) if*u*_{lg}= 1, and from*Bernoulli*(*π*_{l0}) if*u*_{lg}= 0. We set*π*_{l1}= 0.7,*π*_{l0}= 0.3 for*l*= 1, · · ·,*n*_{y}.Generate

. We generate*y**y*_{lg}from*Gamma*(*shape*= 7,*scale*= 0.5) if*v*_{lg}= 1, and generate*y*_{lg}from*Gamma*(*shape*= 1,*scale*= 1) if*v*_{lg}= 0.

##### F. Data type 3: *t*

Generate

*u*^{t}. We generate*u*_{dg}from*Bernoulli*if*z*_{dk}= 1.Generate

*m*^{t}. We generate*m*_{dg}from*Bernoulli*(*π*_{d1}) if*u*_{dg}= 1, and from*Bernoulli*(*π*_{d0}) if*u*_{dg}= 0. We set*π*_{d1}= 0.4,*π*_{d0}= 0.7 for*d*= 1, · · ·,*n*_{t}.Generate

. We generate*t**t*_{dg}from*Beta*(*α*= 0.5,*β*= 0.5) if*m*_{dg}= 1, and generate*t*_{dg}from*Beta*(*α*= 1,*β*= 10) if*m*_{dg}= 0.

We set different parameter values for the four additional simulation settings mentioned in Section 4 as following:

Simulation setting 1: imbalanced dataset, where the numbers of cells,

*n*_{x},*n*_{y}, and*n*_{t}are different across modalities (Table S.1). Data is generated as described above, but we set*n*_{x}= 1000,*n*_{y}= 2000, and*n*_{t}= 500.Simulation setting 2: unequal number of clusters, where the numbers of clusters in the three modalities,

*K*_{x},*K*_{y},*K*_{t}are different (Table S.2). Data is generated as described above, but we applied following scheme to generate*ω*^{y}and*ψ*_{x},*ψ*_{y},*ψ*_{t}so that*K*_{x}= 3,*K*_{y}= 7,*K*_{t}= 5.

Generate

*ω*^{y}For

*g*= 1, …, 350:For

*g*= 351, …, 1000:Generate

*ω*^{x}and*ω*^{t}.For

*g*= 1, …, 350:For

*g*= 351, …, 1000:For

*g*= 1, …, 350:For

*g*= 351, …, 1000:Generate

*z*^{x},*z*^{y}and*z*^{t}. The cluster labels are generated by , and so that*K*_{x}= 3,*K*_{y}= 7,*K*_{t}= 5.

The remaining steps are the same as described above.

(3) Simulation setting 3: imbalanced cluster sizes, where the proportions of different cell types in the three modalities,

*ψ*_{x},*ψ*_{y},*ψ*_{t}are different (Table S.3). Data is generated as described above, but we set*ψ*_{x}= (0.3, 0.1, 0.6),*ψ*_{y}= (0.6, 0.3, 0.1),*ψ*_{t}= (0.6, 0.1, 0.3). There are rare cell types (10% of the cells) in the three modalities.(4) Simulation setting 4: smaller number of features (Table S.4). Data is generated as described above, but we set

*p*= 500.

### S.2 Supplementary Figures

#### S.2.1 Histograms for two real data applications

#### S.2.2 Determination of number of clusters *K* for real data applications

We applied the Silhouette method (Kaufman and Rousseeuw, 1990) mentioned in Section 2.9 on the two real data applications to determine *K* before we apply scAMACE. The result for real application 1 is presented in Figure S.2: is chosen for the three single-cell datasets, where the true number of cell types is 2. The results for real data application 2 are presented in Figure S.3. There are five cell types in scRNA-Seq data (Tasic *et al*., 2018), including astrocytes, oligodendrocytes, and three subtypes of neurons. There are three cell types in sci-ATAC-Seq data (Cusanovich *et al*., 2018b), including astrocytes, oligodendrocytes, and excitatory neurons CPN. And there are three cell types in sc-methylation dataset (Luo *et al*., 2017), including three subtypes of neurons. In the three datasets, the optimal numbers of clusters chosen by the Silhoutte method tend to be smaller than the numbers of cell types, which is likely due to the similarity of the neuronal subtypes. We chose *K* = 5 when we implement scAMACE, instead of the suggested by the Silhoutte method.

#### S.2.3 Empirical distribution of *ω*

To assess whether the linear and quadratic models work well, we performed the following steps. We first obtain as described in Section 2.6 by setting *K* = 1 and fitting the model on the three modalities separately. We plotted the distributions of v.s. and v.s. for the two real data applications in the left panels in Figures S.4 and S.5, respectively. For better visualization on how and change with , we plotted the boxplots of and in different ranges of . We estimated {*η, γ, τ, δ, θ, ϕ*^{acc}, *ϕ*^{met}} by beta regression (Silvia and Francisco, 2004) using . Using and , we then generated and by random sampling following the quadratic and linear models. The distributions of the generated and are plotted in the right panels in Figures S.4 and S.5, respectively. In comparison of the generated values (left panels in Figures S.4 and S.5) with the estimated and (right panels in Figures S.4 and S.5), we can see that the linear and quadratic models capture the trends on how and changes with .

### S.3 Supplementary Tables

## Acknowledgements

We would like to thank Jinwen Yang, Wenyu Zhang and Pengcheng Zeng for the helpful discussions.