Abstract: Genes are regulated by cis-regulatory elements, which contain transcription factor (TF) binding motifs in specific arrangements. To understand the syntax of these motif arrangements and its influence on cooperative TF binding, we developed a new convolutional neural network called BPNet that models the relationship between regulatory DNA sequence and base-resolution binding profiles from ChIP-exo/nexus experiments targeting four pluripotency TFs Oct4, Sox2, Nanog, and Klf4 in mouse embryonic stem cells. BPNet is able to predict base-resolution binding profiles and footprints on sequences not used in training at unprecedented accuracy on par with replicate experiments. However, the primary appeal of neural networks for this specific application is that they are capable of learning predictive sequence representations from raw DNA sequence with minimal assumptions. Hence, interpreting these purported black box models could reveal novel insights into the cis-regulatory code. We developed a suite of model interpretation methods to learn novel motif representations, accurately map predictive motif instances in the genome and identify higher-order rules by which combinatorial motif syntax influences cooperative binding of these TFs. We discovered several novel motifs bound by these TFs supported by distinct footprints. We further found that instances of strict motif spacing are largely due to retrotransposons, but that soft motif syntax influences TF binding at protein or nucleosome range in a directional manner. Most strikingly, Nanog binding is driven by motifs with a strong preference for ~10.5 bp spacings corresponding to helical periodicity. We then validated our model’s predictions using CRISPR-induced point mutations of motif instances. The sequence representations learned by the binding models can also be seamlessly transferred to accurately predict differential chromatin accessibility after TF depletion and massively parallel reporter experiments. BPNet easily adapts to other types of profiling experiments (e.g. ChIP-seq, DNase-seq, ATAC-seq, PRO-seq), thus paving the way to decipher the complexity of the cis-regulatory code using deep learning oracle models of functional genomics data.
(Will be held online)