Data engineering and feature selection
Mining functional annotation databases in conjunction with extensive literature searches allowed us to identify 92 proteins involved in RNA methylation (Supplementary Data 1). These were either methyl-writers (known RNA methyltransferases6 and their partner proteins in protein complexes), or enzymes previously annotated as putative RNA methyltransferases (see Methods). Genes encoding for these proteins constituted our positive class (Class 1) in machine learning analyses. To frame our predictive modelling as a binary classification problem, we assembled multiple stratified training and test datasets by randomly sampling a number of genes equal to our positive set from the remaining genome, ensuring that all genes of our initial dataset were sampled exactly once (Fig. 1). Our rationale was that this would allow machine learning models to be trained and tested across a diverse range of other gene functions, instead of just choosing one function for the negative set. In addition, this approach alleviates any putative bias that may arise from sampling a single negative set of genes from the human genome.
We initially pooled 50,176 features collected from publicly available and previously curated transcriptomic, proteomic, functional annotation, structural and physical interaction datasets (Supplementary Data 2). To identify features that were informative for classification and thereby useful for predicting genes associated with RNA methylation, we performed feature selection prior to model training, followed by feature ranking after training and cross-validation. To reduce the feature-to-sample ratio, first we eliminated features with excessive missing data in the training dataset. Second, we removed features with low variance, which resulted in a drastic dimensionality reduction to 1,505 features for the final dataset. Selected features used for classification were drawn from BioGPS16 (35), Gene Ontology17 (GO: 59), GTEx18 (1,114), Human Protein Atlas19 (HPA: 107), InterPro (1), Pathway Commons (PathCommons: 150) and TISSUES20 (40) datasets.
During model training and cross-validation, we computed feature importance by using the GB importance measure as averaged across all training sets. The 50 most informative features and their relative importance in classification are shown in Supplementary Fig. 1. The features with the highest importance for the full feature set were mainly GO terms, such as GO:0032259, GO:0016740, GO:0003723, GO:0008168 and GO:0016070, all corresponding to methylation, transferase/methyltransferase activity and RNA metabolic processes. Equally, the InterPro domain IPR029063, which represents the S-adenosyl-L-methionine-dependent methyltransferase superfamily was ranked among the top 50 most informative features (Supplementary Fig. 1a). Although anticipated, the fact that the classifiers seemed to rely on RNA and methylation-related annotation features provides support that the models learn to classify genes with a strong link to RNA methylation processes.
Although GO annotations are informative, they may equally bias gene prediction towards pre-existing functional annotations. We assembled thus a second feature set of reduced dimensionality, by excluding GO and InterPro data types. When classifiers were trained on this reduced feature set, the most informative types of features were mainly GTEx expression profiles (Supplementary Fig. 1b). The GTEx project aims to provide a comprehensive public resource of tissue-specific gene expression and regulation, so far including samples from 54 non-diseased tissues across nearly 1000 individuals18. Tissue sample expression data as integrated into Harmonizome and thus sampled here, consist of one-hot-encoded sets of genes with high or low expression in each tissue sample relative to other tissue samples from the GTEx tissue expression profiles dataset.
A possible interpretation of the high ranking of such GTEx expression profile features is that under specific biological conditions, i.e., in certain tissues, RNA methylation genes tend to be collectively down- or up-regulated as compared to other processes. Alternatively, a high ranking of GTEx features may be due to the high proportion of GTEx features in the feature set and noise originating from the high dimensionality of the training dataset with respect to the feature-to-sample ratio. To investigate this further, we calculated the relative frequency of GTEx features in the top hundred most informative features across models from all training sets (Supplementary Data 3). Notably, certain samples taken from the areas of blood, heart, pancreas, and brain were retrieved as informative by more than a hundred models.
We selected five machine learning classifiers (LR, GNB, SVM, RF and GB) and trained each on training sets from the full and the reduced feature set, creating an ensemble of models per classifier and feature set. Overall, all five model ensembles showed very similar performance based on cross-validation (Table 1). Among classifiers trained using the full feature set, GB and RF models showed the highest average accuracy at 0.875 and 0.870, respectively, as well as a similarly high average precision of 0.895 and 0.870, respectively. The GB ensemble followed by that the RF models also yielded the highest AUROC score, with an average AUC estimated at 0.938 and 0.937, respectively.
The performance of the five classifiers for the reduced feature set without GO/InterPro annotations was diminished compared to the full dataset (Table 1). The model ensembles of SVM and RF outperformed the remaining three ensembles across almost all metrics. SVM models performed the best on the reduced feature set based on cross-validation, with an average prediction accuracy of 0.812, precision of 0.822 and AUROC of 0.864.
To evaluate results from different models and feature sets, we first compared the distribution of probability scores for all human genes, as predicted by models trained on the full feature set (Fig. 2a) and models derived from the reduced feature set (Fig. 2b). The prediction landscape appeared very similar across all five types of model ensembles, as exemplified by the extensive overlap of their respective distributions. Most genes were highly skewed towards an average probability score of zero, in line with the hypothesis that the majority of human genes are not expected to be directly involved in RNA methylation pathways. Note, however, that GNB models showed an aberrant prediction profile, with a notably higher peak near probability one, compared to that of all other classifiers.
Second, we assessed the degree of overlap among genes predicted with high confidence to be involved in RNA methylation by different machine learning models. Here, we defined as high confidence all genes in the top 1% of the probability distribution for Class 1. For most classifiers this fraction encompassed ~270 genes, except from GNB models that ambiguously assigned the same high probability score to a large number of genes, hence a larger set was considered (> 1,750 genes). When comparing top predictions from different models, their relative concordance was high for both feature sets (Supplementary Fig. 2). We identified over 280 genes that were commonly predicted to be involved in RNA methylation pathways, by at least three out of five types of model ensembles at the selected confidence cut-off (Supplementary Fig. 2a, b).
Finally, to get a high-level understanding of the predictions from different models, we performed exploratory GO enrichment analyses using the same high confidence genes as above. The top 10 enriched terms for each machine learning model are compared in Fig. 3. All model ensembles, independently of the dataset they derived from, yielded predictions enriched in GO terms associated with RNA-binding and RNA catalytic activities. Note that top enrichment results for models trained on the full feature set included additionally terms associated with chromatin and protein methylation processes (Fig. 3). Of models trained on the reduced feature set, only GNB and SVM showed an enrichment in protein methylation. This may indicate a modelling artefact, i.e., predictions erroneously assigned to Class 1, that could be caused by the hierarchical nature of GO terms (e.g., “methylation” being the parent term of both “RNA methylation” and “protein methylation” processes). Yet, all models with the exception of GB predicted genes that were previously associated with a catalytic activity on DNA. Therefore, an alternative interpretation is that our models capture a putative functional link between modification pathways operating on different substrates. Overall, the functional annotation analyses provided a good qualitative control for model performance. The rationale here is that although we did not recover enrichment in the biological term “RNA methylation” per se (given that the models predict “novel” genes), features closely associated with the term should figure among the top GO results. A breakdown of the GO enrichment results by feature set is provided in Supplementary Fig. 3.
In silico validation
Of all classifiers, GB models that were trained on the full feature set showed the best performance based on cross-validation, and have thus been selected to apply on previously unseen test data. Model performance metrics for the stratified test datasets were calculated by averaging the values obtained for each model in the ensemble. The average test set accuracy for the GB ensemble was 0.905, precision 0.897, recall 0.923, and AUCROC 0.973.
It is known, however, that a high feature-to-sample ratio may lead to overfitting and overestimation of model performance. As our machine learning modelling was based on a small number of positive genes (Class 1), we were particularly interested in obtaining an estimate of the false positives in our predictions. To this goal, we pooled together the entire hold-out data (18 positive and 5,368 negative examples) and averaged the predicted probability score of each gene as estimated by each model in the GB ensemble. This allowed us to estimate the false positives, by counting the number of negative genes (Class 0) that were wrongfully predicted by our models as positive. Of the 5368 negative genes, 425 were erroneously classified, resulting to an estimated false positive rate of 0.079, defined as the number of False Positives (FP) divided by the sum of False Positives (FP) and True Negatives (TN).
With regards to novel predictions, we first selected the top hundred genes predicted by the GB models to associate with RNA methylation pathways as candidates for further validation (Supplementary Data 4). To evaluate these predictions with respect to previously known RNA methylation genes, we first performed a hierarchical clustering analysis of predicted plus positive (Class 1) genes based on the machine learning data used here (Supplementary Fig. 4). As anticipated, known and predicted genes were well clustered together, with no evident split between known and predicted RNA methylation genes. Note, however, that an unsupervised clustering approach of all human genes based on the same features used in our supervised modelling analyses was not sufficient on its own for identifying novel genes involved in RNA methylation pathways, as positive genes did not group together in a single cluster (Supplementary Fig. 5).
Second, we interrogated the STRING database21 for independent Protein–Protein Interaction (PPI) information on known RNA methylation genes and other genes of the human genome. We built a PPI network based on interactions with a confidence score of 400 or above, and performed Random Walks starting from proteins known to mediate methylation of RNAs (Class 1). This allowed us to weigh all other proteins in the network and rank them by their importance relative to our positive gene set. To evaluate whether genes predicted by our models were highly ranked among important interactors, we performed Gene Set Enrichment Analysis (GSEA) using the PageRank score as an input. We obtained a strong positive enrichment (NES = 1.605, P = 0.0001) for the model predictions (Supplementary Data 5), corroborating their close functional association with RNA methylation pathways based on independent PPI evidence (Fig. 4).
Insights into the role of new predictions
To gain functional insights into the role of newly predicted genes with regards to previously annotated RNA methyltransferases and associated proteins, we interrogated the STRING database for available PPI data connecting our model predictions to known RNA methylation genes. Our search unravelled a dense network of interactions (Fig. 5a), comprising 2,450 edges (confidence ≥ 400). To further dissect these PPI data and identify subgroups of proteins associated with specific pathways, we employed the Louvain method of community detection22. We identified six communities in total (Fig. 5b), which we annotated using a large collection of functional annotation resources23.
Community 1 (C1, Fig. 5b) groups most RNA methylation genes from the positive set, together with 10 model predictions: CTU2, FARS2, HEMK1, KARS, MOCS3, MTO1, N6AMT1, PUS1, PUS3 and TRNT1. Functional analysis of community members showed that proteins comprising this sub-network are significantly enriched in the functions of tRNA modification (GO:0006400, P = 5.09E−70), tRNA methylation (GO:0030488, P = 6.31E−66), and tRNA processing (Reactome R-HSA-72306, P = 4.10E−45). Indeed, four predictions in the cluster, CTU2, MOCS3, PUS1 and PUS3, are RNA modifying enzymes mediating tRNA modifications. CTU2 and MOCS3 are involved in 2-thiolation of mcm5S2U at wobble positions of tRNAs, whereas PUS1 and PUS3 belong to the tRNA pseudouridine synthase TruA family and mediate the formation of pseudouridine at positions 27/28 and 38/39 of certain tRNAs, respectively13. Among other members of the same community, the gene TRNT1 encodes the mitochondrial CCA tRNA nucleotidyltransferase 1 responsible for the addition of the conserved 3′-CCA sequence to tRNAs. It has been previously reported that the presence of the 3′-CCA tail on tRNA is required for target recognition by the tRNA methyltransferase NSUN624, which could underlie the functional link of TRNT1 with RNA methylation genes in our analyses.
Likewise, two aminoacyl-tRNA synthetases, FARS2 and KARS, were also predicted to be closely associated with RNA methylation pathways and were part of Community 1. FARS2 is a mitochondrial Phenylalanine-tRNA ligase, responsible for the charging of tRNA(Phe) with phenylalanine. KARS encodes a Lysyl-tRNA ligase. Although, we have not found any orthogonal evidence linking FARS2 to RNA methylation, KARS has been previously inferred to physically interact with the RNA methyltransferase TRMT1, based on co-fractionation data (source BioGRID25).
The same sub-network also included two HemK methyltransferases, HEMK1 and N6AMT1. The former is a N5-glutamine methyltransferase responsible for the methylation of the glutamine residue in the GGQ motif of the mitochondrial translation release factor MTRF1L26. N6AMT1 methylates the eukaryotic translation termination factor 1 (eRF1) on Gln-185. Notably, it has been reported that N6AMT1 forms the catalytic subunit of a heterodimer with the RNA methyltransferase TRMT11227, suggestive of a functional interplay between RNA methylation and post-translational modifications of translation factors.
Our models also predicted that MTO1 is a gene functionally associated with RNA methylation pathways. Previous studies have shown that MTO1 encodes for a mitochondrial protein which is indeed involved in the 5-carboxymethylaminomethyl modification (mnm5s2U34) of the wobble uridine base in mitochondrial tRNAs, with a crucial role in translation fidelity28.
Community 2 (C2, Fig. 5b) consists mainly of newly predicted genes, associated with four genes from the positive set: C7orf60, HENMT1, RRNAD1 and RSAD1. The gene C7orf60 or BMT2 encodes a probable S-adenosyl-L-methionine-dependent methyltransferase. Recent studies have suggested that BMT2 (also known as SAMTOR) acts as an inhibitor of mTOR complex 1 (mTORC1) signalling in human, a SAM sensor signalling methionine sufficiency29. In yeast, BMT2 is responsible for the m1A2142 modification of 25S rRNA30. Two other methyltransferase genes in the same cluster were RRNAD1 and HENMT1. The former encodes for ribosomal RNA adenine dimethylase domain containing 1, but little is known about its function. HENMT1 is a small RNA methyltransferase that adds a 2′-O-methyl group at the 3′-end of piRNAs, contributing to the maintenance of Transposable Element (TE) repression in adult germ cells31. Functional annotation of this community indicated an enrichment in peptidyl-lysine methylation function (GO:0018022, P = 1.92E−06), albeit this was based on only four proteins out the 23 forming this cluster (SETD4, VCPKMT, METTL21A, and METTL18). Among members of this community, we identified proteins with a role in methylation of other substrates. For example, FAM86A catalyses the trimethylation of the elongation factor 2 (eEF2) at Lys-52532. METTL13 is also a methyltransferase responsible for the dual post-translational methylation of the elongation factor 1-alpha (eEF1A) at two positions (Gly-2 and Lys-55), modulating mRNA translation in a codon-specific manner33. Both genes are involved in modifying translation elongation factor residues, same as N6AMT1 mentioned above. Our results hence suggest that post-translational modifications of translation factors and epitranscriptomic changes on RNAs could be interconnected in modulating translational efficiency.
Community 3 (C3, Fig. 5b) comprises 48 protein members, of which 10 are part of our positive set and 38 were predicted by the models. Overall, we found a strong enrichment for functional terms linked to ncRNA processing (GO:0034470, P = 6.79E−40) and rRNA processing (R-HSA-72312, P = 1.03E−39). This finding is consistent with previous computational approaches aiming to predict the functional role of m6A modification sites, which have independently shown a strong connection between RNA methylation and RNA processing34. Among Community 3 members for instance, our predictions include five genes encoding for members of the nuclear RNA exosome, DIS3, EXOSC2, EXOSC5, EXOSC8 and EXOSC9. The exosome is known to participate in a wide variety of cellular RNA processing and degradation events preventing nuclear export and/or translation of aberrant RNAs. Exosome function is thus likely to be interlinked with epitranscriptomic marks on RNAs.
We also identified a sub-cluster within the community connecting DIMT1, EMG1, FBL and NOP2 with 15 proteins predicted by our models. All members of the sub-cluster are RNA-binding proteins involved in rRNA modification in the nucleus (R-HSA-6790901, P = 5.44E−36). EMG1 encodes for an RNA methyltransferase that methylates pseudouridine at position 1248 in 18S rRNA35. Pathway annotation data further suggest that EMG1 together with eight new predictions (CIRH1A, DCAF13, HEATR1, NOL11, UTP3, UTP6, UTP20 and WDR3) are required in pre-18S rRNA processing and ribosome biogenesis. Of these, the NOL11 gene encodes a nucleolar protein contributing to pre-rRNA transcription and processing36. Partial evidence furthermore suggests that NOL11 interacts with the rRNA 2′-O-methyltransferase fibrillarin, FBL, which is involved in pre-rRNA processing by catalysing the site-specific 2′-hydroxyl methylation of pre-ribosomal RNAs36. FBL together with RRP9 and NOP56 are part of the box C/D RNP complex catalysing the ribose-2′-O-methylation of target RNAs.
Finally, three novel gene predictions within this community, DPH5, TPMT and RRP8, were previously reported to have SAM-dependent methyltransferase activity. DPH5 is coding for a methyltransferase that catalyses the trimethylation of the eEF2 as part of the diphthamide biosynthesis pathway, whereas TPMT encodes an enzyme that metabolises thiopurine drugs. We cannot rule out that these may be false positives cases, i.e., erroneous predictions that stem from the presence of the SAM-binding domain in the protein. Yet genes mediating post-translational modifications were repeatedly classified as components of RNA methylation pathways by our machine learning models (e.g., FAM86A in Community 2). A noteworthy case is RRP8, which in human is reported to bind to H3K9me2 and to probably act as a methyltransferase, yet studies in yeast have shown that the RRP8 homologue is responsible for installing m1A in the peptidyl transfer centre of the ribosome (m1A645 in 25S)37.
Community 4 (C4, Fig. 5b) constitutes a large cluster of 42 proteins. Functional analysis of the group indicates that most community members are chromatin modifying enzymes (R-HSA-3247509, P = 8.74E−29), or are associated in general with chromatin organisation (R-HSA-4839726, P = 8.74E−29) and histone modification (WP2369, P = 1.08E−23). Previously known RNA methylation genes in this community were mainly involved in RNA-capping pathways, e.g., RNMT, CMTR1, CMTR2, FAM103A1, TGS1 and RNGTT. Recent studies have suggested that there is indeed extensive crosstalk between RNA modifications and epigenetic mechanisms of gene regulation7,38,39.
Community 5 (C5) and Community 6 (C6) encompass fewer members than the other communities. Community 5 consists of 10 proteins creating a small sub-network of RNA methyltransferases and partner proteins involved in RNA methylation (GO:0001510, P = 1.91E−17) and mRNA methylation, in particular (GO:0080009, P = 6.26E−16). Notably, this community captures proteins involved in the m6A pathway, including the m6A writer complex of METTL3-METTL14 with co-factor WTAP, METTL16 and ZC3H13, as well as the m6Am writer METTL440. Community 6 is the smallest of all communities with only four protein members, two previously annotated RNA methylation genes, HSD17B10 and KIAA0391, and two predicted genes POP1 and POP4. Functional analysis suggests that all four proteins contribute to tRNA processing (R-HSA-72306, P = 5.97E−09) and three of them are involved in tRNA 5′-end processing (GO:0099116, P = 5.32E−08). The HSD17B10 gene encodes the 3-hydroxyacyl-CoA dehydrogenase type-2, which is involved in mitochondrial fatty acid beta-oxidation. HSD17B10 is involved in tRNA processing as it also forms a subcomplex of the mitochondrial ribonuclease P together with TRMT10C/MRPP141. This subcomplex, named MRPP1-MRPP2, catalyses the formation of N1-methylguanine and N1-methyladenine at position 9 (m1G9 and m1A9, respectively) in tRNAs. KIAA0391, also known as PRORP, encodes a catalytic ribonuclease component of mitochondrial ribonuclease P. It appears that POP1 and POP2 are also components of ribonuclease P and contribute to tRNA maturation via 5′-end cleavage.
False positive discoveries
Our machine learning models and analyses have provided a wealth of new information on putative gene networks underpinning RNA methylation in human. However, it is worth noting the limitations of our approach. First, it is uncertain whether employing previous knowledge from functional annotations may have biased model predictions. We addressed this caveat to an extent by using a reduced feature set without annotation features, such as GO terms. When looking at predictions based on models trained on this dataset, machine learning models point to a recurrent theme of this study: that RNA methylation is functionally interconnected to a range of other core cellular functions (Fig. 3 and Supplementary Fig. 3). For example, we found genes encoding chromatin modifiers among the top candidates. The key question here is whether these genes represent false positives, spurred by the hierarchical structure of GO terms or the shared SAM-binding domain. These ambiguous predictions should be subject to further validation, although multiple lines of evidence suggest that this could well be a biologically meaningful result echoing the crosstalk between DNA, RNA and post-transcriptional modification processes.
Second, there is no trivial way to control for false positives. Because only a few writer enzymes are to date known to deposit methyl-marks on RNA6, we started from a very limited number of positive (and by consequence negative) samples to use for machine learning. Even though model performance based on test data was good, the small sample sizes may have hampered how well our models generalise. To better illustrate this limitation, we revisit our model predictions on test data, comprised of 18 positive and 5,368 negative genes in total. When considering the top 25 predictions (an equivalent fraction to the top 100 out of all predictions reported above), the False Positive Rate is very low at 0.002 (Supplementary Fig. 6). Nonetheless, because the number of total positives in our test data is also very low, the False Discovery Rate, defined as the false positives (FP) divided by the sum of false positives (FP) and true positives (TP), at this probability window is 0.56. Our machine learning models thus overpredict genes associated with RNA methylation pathways, where only a small fraction of the human genes plays a role in RNA methylation. To address this important caveat, we sought for independent evidence by mining human PPI data to corroborate that newly predicted genes are indeed associated with RNA methylation pathways.
For the afore-mentioned reasons and as a guidance for the interpretation of our results, for each candidate RNA methylation gene, we provide its predicted probability score across all machine learning models -derived with or without using functional annotation data-, as well as its PageRank score from the PPI network analysis (Supplementary Data 5). A gene with a consistently high probability score across multiple models, along with a high rank in the human PPI network, is less likely to represent a modelling artefact.