Using Deep Learning to detect DNA-regulatory elements

0. Authors

0.1 Corresponding Authors:

png png

  • Predicting the sequence specificities of DnA- and RnA-binding proteins by deep learning

png png

  • Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks

0.2 Important contributor to Deep Learning:

BIG THREE during the development of Deep Learning theory:


Geoffrey Hinton:


  • Big contribution Two Times


by Ran Bi, NYU

1. A brief introduction to DL

1.1. Data vitualization: From PCA to tSNE

Example from Colah’s Blog Visualizing MNIST: An Exploration of Dimensionality Reduction:

  • PCA performed well overall, but not well in some detail region, like to divide 4, 7and 9
Visualizing MNIST with PCA
  • tSNE performed much better.
Visualizing MNIST with t-SNE

1.2. tSNE were invented during the development of Deep learning theory


by Laurens van der Maaten

1.3. Deep Learning includes:

  • Multi-layer network (The inception of Deep learning)
    • More Layers.
    • Using GPU(显卡) for calculation with many many trials in a parallel way.
  • Convolutional Neuron Network (CNN, 卷积神经网络)
  • Recurrent Neuron Network (RNN, 循环神经网络)
    • Time serial information(Video, Audio)
    • Using LSTM algorithm to make the calculation easier.
  • Reinforce Learning Neuron Network(增强学习)
    • Value network:To FEEL the environment.
    • Policy network:To Decide what is the best solution.
    • AlphaGO, Unpiloted cars

1.4 Deep Learning ABC:

  • Three basic kinds of calculation: multiplication, addiction and transformation.


by Deep Learning Udacity Course–ud730/

  • For example: Colah’s blog Neural Networks, Manifolds, and Topology

    • The transforming process: gif

    • The key point is the training for parameters to be multiplied, added and so on. Bad parameters could get bad results: gif

    • More dimension were required for complex datasets. For example, it is hard to divide these points in a 2D graph: gif However, after adding one demension, these two groups could be divided by a layer instead of a line. png

1.5 Convolution Kernal

png png

by iOS Developer Guide

1.6 Deep Learning frame work

Framework Core Programming Language Interfaces from Other Languages Programming Paradigm Wrappers
Caffe C++/CUDA Python, Matlab Imperative -
TensorFlow C++/CUDA Python Declarative Pretty Tensor, Keras
Theano Python (compiled to C++/CUDA) Declarative Keras, Lasagne, or Blocks
Torch7 LuaJIT (with C/CUDA backend) C Imperative -

TensorFlow: Biology’s Gateway to Deep Learning?

2. Predicting the sequence specificities of DNA- and RNA-binding proteins

2.1 Using published Data to train the model

2.2. The structure of the neural networks:


Let’s see it in detail:


2.2.1 conv

e.g. Using motif with length of 3 to convolve the input sequence: ATGG


2.2.2 recifity


Vanessa’s blog

2.2.3 pooling


Deep Learning For Java

2.2.4 neural network

  • Fully connected Layer
    • Multiplication + Addiction + Transformation. Scale to sum of one at last.


Visual Studio Magzine

  • The process for calculation:

2.3. Optimizing parameters to get the best performance:


  • Calibrate: Using 3xCV to estimate 30 groups of parameters and select the best one
  • Train: Repeat Calibrate process several times.
  • Test: Using the best parameters in a non-used data for testing.
  • Store these group of parameters for predicting new data without training.

2.4. Quantitative performance on various types of held-out experimental test data.

2.4.1 DNA binding

in vitro Micro-array , in vivo ChIP , better performance


2.4.2 RNAcomplete micro-array

better than formal methods


Check in TF level


2.4.3 Using all peaks rather than top500 peaks will get better result


2.5 Potentially disease-causing genomic variants


  • A disrupted SP1 binding site in the LDL-R promoter that leads to familial hypercholesterolemia
  • A gained GATA1 binding site that disrupts the original globin cluster promoters.

2.6 RNA binding proteins preference for up-stream and down-stream information

  • Exons known to be downregulated by Nova had higher Nova scores in their upstream introns, and exons known to be upregulated by Nova had higher Nova scores in their downstream intron.


  • TIA has been shown to upregulate exons when bound to the downstream intron


2.7 What are motifs like in the convolution kernal after training.

  • Comparing with known databases(DNA, jaspar. RNA, CISBP-RNA):


3. Learning the regulatory code of the accessible genome with Deep CNN.

  • A very familiar Deep Learning structure comparing with the NBT 3300 article.
  • Source code available.

3.1 Data source

  • Encode Project Consortium + Roadmap Project, 164 samples’ BED file for peaks.
  • Hg19 genome sequences.
200 million x (600bp*4) 200 million x 164

e.g, Y is like:


3.2 The structure of the neural network

  • Familiar structure


  • More layers


This architecture is recommended by Spearmint

3.2.1: SGD:

divide all training-samples into many subsets. Using one set to update parameters in order to speed up.


Sebastian Ruder’s blog:

3.2.2: Batch Normalization(BN):

Definition: Input: Values of x over a mini-batch:

Parameters to be learned: Output:

Yes, BN is just like z-score, which can scale values for training to the center of the optimizer, which can help speed up the optimization as well as get a higher accuracy.

3.2.3: Drop-out

Randomly choosing a subset of nodes to train in order to guarantee the robustness of the network.


Journal of Machine Learning Research 15(2014) 1929-1958:

3.3 Basset accurately predicts cell-specific DNA accessibility


  • Better than formal method.
  • Differences for AUC between cell types.

3.4 The convolution kernal


For A:

  • x axis Information content is:
  • y axis Influences reflects the accessibility prediction changes over all cells.

  • high influeces but unannotated includes CpG and ATAT boxes.
  • 45% kernals could be annotated.

For C:

  • cell-specific patterns.

3.5 Accessibility and Binding-Sites


  • AP-1 complex members includes JUN and JUND
  • The open region inclueds JUN/JUND peaks.
  • Basset result showed a mutation in FOS motif will induce to the loss of the accessibility.


  • Conservation also showed a correlation with signal.

3.6 Using GWAS data to validate.


  • Basset score for general GWAS SNP vs causal GWAS SNP.


  • Basset report T>C a 85% for causality for vitiligo(白癜风) for rs4409785.
  • DNA were opened and CTCF could bind.


  • Encode CTCF data for raw reads in rs4409785.
  • 21 / 88 Samples were sequences here with 11 have peaks, and almost all sequenced samples have T>C mutation.

4. Deep Learning needs GPU

type Tesla K20m GPU Mac Intel i7-CPU
Seeded single-task 18m 6h37m
Full multi-task 85h -

本文作者Boqiang Hu, 欢迎评论、交流。
转载请务必标注出处: [20160808 Journal Club]Using Deep Learning to study the gene-regulatory elements.