We introduce a k-mer-based computational protocol, DE-kupl, for capturing local RNA variation in a set of RNA-seq libraries, independently of a reference genome or transcriptome. DE-kupl extracts all k-mers with differential abundance directly from the raw data files. This enables the retrieval of virtually all variation lying in an RNA-seq dataset. This variation is subsequently assigned to biological events such as differential lncRNAs, splice and polyadenylation variants, introns, repeats, editing/mutation events, or exogenous RNA. Applying DE-kupl to different human RNA-seq datasets identified multiple types of novel events, reproducibly across independent RNA-seq experiments.
Quantitatively predicting phenotype variables by the expression changes in a set of candidate genes is of great interest in molecular biology but it is also a challenging task for several reasons. First, the collected biological observations might be heterogeneous and correspond to different biological mechanisms. Secondly, the gene expression variables used to predict the phenotype are potentially highly correlated since genes interact though unknown regulatory networks. In this paper, we present a novel approach designed to predict quantitative trait from transcriptomic data, taking into account the heterogeneity in biological samples and the hidden gene regulatory networks underlying different biological mechanisms. The proposed model performs well on prediction but it is also fully parametric, which facilitates the downstream biological interpretation. The model provides clusters of individuals based on the relation between gene expression data and the phenotype, and also leads to infer a gene regulatory network specific for each cluster of individuals. We perform numerical simulations to demonstrate that our model is competitive with other prediction models, and we demonstrate the predictive performance and the interpretability of our model to predict alcohol sensitivity from transcriptomic data on real data from Drosophila Melanogaster Genetic Reference Panel (DGRP).
Gaussian graphical models are widely used to infer and visualize networks of dependencies between continuous variables. However, inferring the graph is difficult when the sample size is small compared to the number of variables. To reduce the number of parameters to estimate in the model, we propose a nonasymptotic model selection procedure supported by strong theoretical guarantees based on an oracle type inequality and a minimax lower bound. The covariance matrix of the model is approximated by a block-diagonal matrix. The structure of this matrix is detected by thresholding the sample covariance matrix, where the threshold is selected using the slope heuristic. Based on the block-diagonal structure of the covariancematrix, the estimation problem is divided into several independent problems: subsequently, the network of dependencies between variables is inferred using the graphical lasso algorithm in each block. The performance of the procedure is illustrated on simulated data. An application to a real gene expression dataset with a limited sample size is also presented: the dimension reduction allows attention to be objectively focused on interactions among smaller subsets of genes, leading to a more parsimonious and interpretable modular network. Supplementary materials for this article are available online.
This thesis gathers methodologicals contributions to the statistical analysis of next-generation high-throughput transcriptome sequencing data (RNA-seq). RNA-seq data are discrete and the number of samples sequenced is usually small due to the cost of the technology. These two points are the main statistical challenges for modelling RNA-seq data. The first part of the thesis is dedicated to the co-expression analysis of RNA-seq data using model-based clustering. A natural model for discrete RNA-seq data is a Poisson mixture model. However, a Gaussian mixture model in conjunction with a simple transformation applied to the data is a reasonable alternative. We propose to compare the two alternatives using a data-driven criterion to select the model that best fits each dataset. In addition, we present a model selection criterion to take into account external gene annotations. This model selection criterion is not specific to RNA-seq data. It is useful in any co-expression analysis using model-based clustering designed to enrich functional annotation databases. The second part of the thesis is dedicated to network inference using graphical models. The aim of network inference is to detect relationships among genes based on their expression. We propose a network inference model based on a Poisson distribution taking into account the discrete nature and high inter sample variability of RNA-seq data. However, network inference methods require a large number of samples. For Gaussian graphical models, we propose a non-asymptotic approach to detect relevant subsets of genes based on a block-diagonale decomposition of the covariance matrix. This method is not specific to RNA-seq data and reduces the dimension of any network inference problem based on the Gaussian graphical model.In co-expression analyses of gene expression data, it is often of interest to interpret clusters of co-expressed genes with respect to a set of external information, such as a potentially incomplete list of functional properties for which a subset of genes may be annotated. Based on the framework of finite mixture models, we propose a model selection criterion that takes into account such external gene annotations, providing an efficient tool for selecting a relevant number of clusters and clustering model. This criterion, called the integrated completed annotated likelihood (ICAL), is defined by adding an entropy term to a penalized likelihood to measure the concordance between a clustering partition and the external annotation information. The ICAL leads to the choice of a model that is more easily interpretable with respect to the known functional gene annotations. We illustrate the interest of this model selection criterion in conjunction with Gaussian mixture models on simulated gene expression data and on real RNA-seq data.
In co-expression analyses of gene expression data, it is often of interest to interpret clusters of co-expressed genes with respect to a set of external information, such as a potentially incomplete list of functional properties for which a subset of genes may be annotated. Based on the framework of finite mixture models, we propose a model selection criterion that takes into account such external gene annotations, providing an efficient tool for selecting a relevant number of clusters and clustering model. This criterion, called the integrated completed annotated likelihood (ICAL), is defined by adding an entropy term to a penalized likelihood to measure the concordance between a clustering partition and the external annotation information. The ICAL leads to the choice of a model that is more easily interpretable with respect to the known functional gene annotations. We illustrate the interest of this model selection criterion in conjunction with Gaussian mixture models on simulated gene expression data and on real RNA-seq data.
Motivation: RNA sequencing is now widely performed to study differential expression among experimental conditions. As tests are performed on a large number of genes, stringent false-discovery rate control is required at the expense of detection power. Ad hoc filtering techniques are regularly used to moderate this correction by removing genes with low signal, with little attention paid to their impact on downstream analyses. Results: We propose a data-driven method based on the Jaccard similarity index to calculate a filtering threshold for replicated RNA sequencing data. In comparisons with alternative data filters regularly used in practice, we demonstrate the effectiveness of our proposed method to correctly filter lowly expressed genes, leading to increased detection power for moderately to highly expressed genes. Interestingly, this data-driven threshold varies among experiments, highlighting the interest of the method proposed here. Availability: The proposed filtering method is implemented in the R package HTSFilter available on Bioconductor.
Gene network inference from transcriptomic data is an important methodological challenge and a key aspect of systems biology. Although several methods have been proposed to infer networks from microarray data, there is a need for inference methods able to model RNA-seq data, which are count-based and highly variable. In this work we propose a hierarchical Poisson log-normal model with a Lasso penalty to infer gene networks from RNA-seq data; this model has the advantage of directly modelling discrete data and accounting for inter-sample variance larger than the sample mean. Using real microRNA-seq data from breast cancer tumors and simulations, we compare this method to a regularized Gaussian graphical model on log-transformed data, and a Poisson log-linear graphical model with a Lasso penalty on power-transformed data. For data simulated with large inter-sample dispersion, the proposed model performs better than the other methods in terms of sensitivity, specificity and area under the ROC curve. These results show the necessity of methods specifically designed for gene network inference from RNA-seq data.