pradipta ray

A wordle of my research
I work on computational aspects of comparative and regulatory genomics, with applications to sensory neurons. My interests lie in Regulatory Genomics, Comparative & Evolutionary Genomics, Genome Annotation, Epigenetics, and Selectional analyses. I use cutting edge machine learning techniques (supervised and unsupervised, likelihood-based and non-likelihood based) to tease out meaningful patterns from the data. My broader research interests involve estimation and inference algorithms for machine learning methods in the natural and social sciences.

Current projects

Characterizing the transcriptomic landscape of sensory neurons

Computational analysis of sensory neuronal transcriptomes to identify lineage-specific genes and candidate therapeutic targets and for identification of transcription regulational forces (transcription factors, cofactors, DNA methylation and other epigenetic marks) in peripheral nerves, dorsal root ganglia, trigeminal ganglia, dura mater by contrast with central nervous system (brain and spinal cord) neurons and other tissues. We identify gene co-expression modules and characterize the broad gene expression patterns that shape the sensory neuronal transcriptome. See project website.

Computationally modelling gene transcript variation in sensory neurons

Computational modelling of alternative start site, alternative splicing and alternative polyadenylation in sensory neurons to identify gene transcript variants specific to sensory versus central nervous system (brain and spinal cord) neurons. Additionally, we characterize regulatory elements like splicing factors that are correlated to such transcript variant co-expression patterns, seeking to identify the mechanisms of such tissue-specific expression. Finally, we also identify putative function of such tissue specific transcript variants by analyzing short linear motifs in transcript regions that are unique to particular sensory tissues.

Computational characterization of the active translatome of sensory neurons and its underlying regulatory mechanisms

Computational modelling and analysis of the active translatome in mouse sensory neurons using Translating Ribosome Affinity Purification (TRAP pulldown) and Ribosomal Footprinting assays, and corresponding motif analysis of untranslated regions in gene modules with differential abundance across tissues or conditions. This allows us to explain regulation in two steps: transcriptional and post-transcriptional versus translational. Our analysis models upstream open reading frames (uORFs), and analyzes the regulatory effects such uORFs have on translation.

Characterizing sensory neuronal subtypes by analyzing single cell RNA-seq data in sensory tissues

We perform single cell RNA - sequencing data analysis for identifying sensory neuronal subpopulations, gene regulatory modules in neuronal subpopulations, and analyze how each sensory subtype may have evolved between human and model systems like mouse. Such empirical models of the evolution of transcriptional landscapes clarifies the transcriptional contributions of different cell types, and involved computationally deconvolving transcriptional signatures from sensory tissues by carefully analyzing single cell data.

Finished projects

DIRECTION: Discriminative IntegRative whole Epigenome Classification Toolkit at single nucleotide resolutION

LIBRETTO: LIneage Based analysis of the REgulome and TranscripTOme in H1 hESC cells and derived lineages


5-Methylcytosine and 5-Hydroxymethylcytosine in DNA are major epigenetic modifications known to significantly alter mammalian gene expression. High-throughput assays to detect these modifications are expensive, labor-intensive, unfeasible in some contexts, and leave a portion of the genome unqueried. Many prediction algorithms for methylation prediction exist but they often depend largely on context sensitive motifs that might vary widely from cell type to cell type making transfer learning impossible, and none exist for hydroxymethylation.

Hence, we devised a novel supervised, integrative learning framework to perform de novo whole-genome methylation and hydroxymethylation predictions in CpG dinucleotides. Our framework can also perform imputation of missing or low quality data in existing sequencing datasets. Additionally, we developed infrastructure to perform in silico, high-throughput hypotheses testing on such predicted methylation or hydroxymethylation maps. We test our approach on H1 human embryonic stem-cells and H1-derived neural progenitor cells. Our predictive model is comparable in accuracy to other state-of-the-art DNA methylation prediction algorithms. We are the first to predict hydroxymethylation in silico with high whole-genome accuracy, paving the way for large-scale reconstruction of hydroxymethylation maps in mammalian model systems. We designed a novel, beam-search driven feature selection algorithm to identify the most discriminative predictor variables, and developed a platform for performing integrative analysis and reconstruction of the epigenome. Our toolkit DIRECTION provides predictions at single nucleotide resolution and identifies relevant features based on resource availability. This offers enhanced biological interpretability of generated results potentially leading to a better understanding of epigenetic gene regulation.


The H1 hES cells and derived lineages (neural progenitors, mesoendodermal cells, mesenchymal stem cells and trophoblast like cells) provide a unique model system for understanding the transcriptional landscape and regulatory forces in reprogrammed ES cells and their derived lineages. The NIH Epigenomics Roadmap project undertook this venture (of which we were a part) that studied the regulatory role of epigenetic marks in cell differentiation, but also analyzed the regulatory roles of transcription factors, noncoding RNA and splicing.

For the ES cell and each lineage, we identified lineage-specific coding, long non-coding and smallRNA genes and splice variants using information theoretic approaches. Based on the co-expression patterns, we identified epigenetic marks at promoters, enhancers and splice sites that were correlated to the patterns, and additionally identified transcription factors, splicing factors and small RNAs with predicted targets that were also correlated to the expression patterns to identify the regulatory landscape of the cells.

The Roadmap Consortium website here

The cell paper here

Nature paper here

Overarching consortium paper in Nature here

ASD: Admixture of Stochastic Dictionaries for modelling evolution in regulatory regions

DISCOVER: A feature-based discriminative method for motif search in complex genomes

ASD logo

Functional selection acting on regulatory regions such as the cis-regulatory modules (CRMs) causes differential enrichment of nucleotide contents therein across evolutionarily related orgamisms. The exact impact of such selection on gene regulatory mechanisms is not yet clearly known; but one important characteristic of CRM function in higher organisms is that they are often multi-functional; that is, under different conditions and times, the same sequence in the CRM can drive different biological regulatory functions via recruitment of different combinations of transcription regulatory proteins. Existing models for transcription factor binding site (TFBS) such as PWMs or single dictionaries of oligomers can not capture the multi-functionality of CRM, and offer no insight of the evolutionary mechanism of this phenomena.

We developed a novel Admixture of Stochastic Dictionaries (ASD) model for the CRM and motifs therein, which succinctly extract and expose the sequence-compositional basis of such multi-functionality. We have developed algorithms for learning the Admixture of Stochastic Dictionaries within one organism, and across multiple evolutionarily related organisms, which allow us to examine multi-functionality of CRMs, and the way it evolves by analyzing the extend of change of every functionality-specific dictionary in the ASD models across organisms. We show that the learned component dictionaries in our model are indeed functionally discriminative, and can be used for predicting regulatory regions. We further show that such discriminality is based on their TF binding affinity scores. We find that the corresponding functionality-specific dictionaries across species have similar (but non-identical) distributions over oligomers, such that regulatory information from one species can be used to predict regulatory regions in other species. We conclude that our model is easy to estimate and interpret, and serves as a good platform for modeling functional evolution of the regulatory genome, and a useful tool to identify regulatory function based on these properties.


Identifying transcription factor binding sites (TFBS) encoding complex regulatory signals in metazoan genomes remains a challenging problemin computational genomics. Due to degeneracy of nucleotide content among binding site instances or motifs, and intricate "grammatical organization" of motifs within cis-regulatory modules, extant pattern-matching based in-silico motif search methods often suffer from impractically high false positive rates, especially in the context of analyzing large genomic datasets, and noisy position weight matrices which characterize binding sites.

We try to address this problem by using a framework to maximally utilize the information content of the genomic DNA in the region of query, taking cues from values of various biologicallymeaningful genetic and epigenetic factors in the query region such as clade-specific evolutionary parameters, presence / absence of nearby coding regions, etc We present a new method for TFBS prediction in metazoan genomes which utilizes both the cis-regulatory module (CRM) architecture of sequences and a variety of features of individual motifs. Our proposed approach is based on a discriminative probabilistic model known as conditional random fields that explicitly optimizes the predictive probability of motif presence in large sequences, based on the joint effect of all such features. This model overcomes weaknesses in earlier methods based on less effective statistical formalisms that are sensitive to spurious signals in the data. We evaluate our method on both simulated CRMs and real Drosophila sequences in comparison with a wide spectrum of existing models, and outperform the state of the art by 22% in F1-score.

Website, code and result visualization


Podium talk at ISMB 2009 by Pradipta Ray

CSMET: Conditional Shadowing via Multi-resolution Evolutionary Trees

BayCis: A Bayesian hHMM for cis-regulatory module decoding in metazoan genomes


Functional turnover of transcription factor binding sites (TFBSs), such as whole-motif loss or gain, are common events during genome evolution. Conventional probabilistic phylogenetic shadowing methods model the evolution of genomes only at nucleotide level, and lack the ability to capture the evolutionary dynamics of functional turnover of aligned sequence entities. As a result, comparative genomic search of non-conserved motifs across evolutionarily related taxa remains a difficult challenge, especially in higher eukaryotes, where the cis-regulatory regions containing motifs can be long and divergent; existing methods rely heavily on specialized pattern-driven heuristic search or sampling algorithms, which can be difficult to generalize and hard to interpret based on phylogenetic principles.

We propose a new method: Conditional Shadowing via Multi-resolution Evolutionary Trees, or CSMET, which uses a context-dependent probabilistic graphical model that allows aligned sites from different taxa in a multiple alignment to be modeled by either a background or an appropriate motif phylogeny conditioning on the functional specifications of each taxon. The functional specifications themselves are the output of a phylogeny which models the evolution not of individual nucleotides, but of the overall functionality (e.g., functional retention or loss) of the aligned sequence segments over lineages. Combining this method with a hidden Markov model that autocorrelates evolutionary rates on successive sites in the genome, CSMET offers a principled way to take into consideration lineage-specific evolution of TFBSs during motif detection, and a readily computable analytical form of the posterior distribution of motifs under TFBS turnover. On both simulated and real Drosophila cis-regulatory modules, CSMET outperforms other state-of-the-art comparative genomic motif finders.

Website, code and result visualization


Podium talk at RECOMB Regulatory Genomics 2007 by Eric Xing


The transcriptional regulatory sequences in metazoan genomes often consist of multiple cis-regulatory modules (CRMs). Each CRM contains locally enriched occurrences of binding sites (motifs) for a certain array of regulatory proteins, capable of integrating, amplifying or attenuating multiple regulatory signals via combinatorial interaction with these proteins. The architecture of CRM organizations is reminiscent of the grammatical rules underlying a natural language, and presents a particular challenge to computational motif and CRM identification in metazoan genomes.

We present BayCis, a Bayesian hierarchical HMM that attempts to capture the stochastic syntactic rules of CRM organization. Under the BayCis model, all candidate sites are evaluated based on a posterior probability measure that takes into consideration their similarity to known BSs, their contrasts against local genomic context, their first order dependencies on upstream sequence elements, as well as priors reflecting general knowledge of CRM structure. We compare our approach to five existing methods for the discovery of CRMs, and demonstrate competitive or superior prediction results evaluated against experimentally based annotations on a comprehensive selection of Drosophila regulatory regions.

Website, code and result visualization


Podium talk at RECOMB 2008 by Pradipta Ray