Using Deep Learning to detect DNA-regulatory elements
0.1 Corresponding Authors:
- Predicting the sequence specificities of DnA- and RnA-binding proteins by deep learning
- Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks
0.2 Important contributor to Deep Learning:
BIG THREE during the development of Deep Learning theory:
- Big contribution Two Times
1. A brief introduction to DL
1.1. Data vitualization: From PCA to tSNE
Example from Colah’s Blog Visualizing MNIST: An Exploration of Dimensionality Reduction:
- PCA performed well overall, but not well in some detail region, like to divide 4, 7and 9
- tSNE performed much better.
1.2. tSNE were invented during the development of Deep learning theory
by Laurens van der Maaten
1.3. Deep Learning includes:
- Multi-layer network (The inception of Deep learning)
- More Layers.
- Using GPU(显卡) for calculation with many many trials in a parallel way.
- Convolutional Neuron Network (CNN， 卷积神经网络)
- Recurrent Neuron Network (RNN， 循环神经网络)
- Time serial information(Video, Audio)
- Using LSTM algorithm to make the calculation easier.
- Reinforce Learning Neuron Network（增强学习）
- Value network：To FEEL the environment.
- Policy network：To Decide what is the best solution.
- AlphaGO, Unpiloted cars
1.4 Deep Learning ABC:
- Three basic kinds of calculation: multiplication, addiction and transformation.
by Deep Learning Udacity Course
For example: Colah’s blog Neural Networks, Manifolds, and Topology
The transforming process:
The key point is the training for parameters to be multiplied, added and so on. Bad parameters could get bad results:
More dimension were required for complex datasets. For example, it is hard to divide these points in a 2D graph: However, after adding one demension, these two groups could be divided by a layer instead of a line.
1.5 Convolution Kernal
1.6 Deep Learning frame work
|Framework||Core Programming Language||Interfaces from Other Languages||Programming Paradigm||Wrappers|
|TensorFlow||C++/CUDA||Python||Declarative||Pretty Tensor, Keras|
|Theano||Python (compiled to C++/CUDA)||–||Declarative||Keras, Lasagne, or Blocks|
|Torch7||LuaJIT (with C/CUDA backend)||C||Imperative||-|
TensorFlow: Biology’s Gateway to Deep Learning?
2. Predicting the sequence specificities of DNA- and RNA-binding proteins
2.1 Using published Data to train the model
- 12 TB of sequence data(Protein Binding Micro-array, RNA-compete, ChIP-seq and HT-SELEX)
- PBM(Protein Binding Micro-array), DREAM5 competition .
- RBP(RNA binding protein), RNAcompete.
- Encode ChIPSeq Table S4 in http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/
- Input: X, DNA Sequence grabed by TF;
- Label: Y, is there any peak on this region.
2.2. The structure of the neural networks:
Let’s see it in detail:
e.g. Using motif with length of 3 to convolve the input sequence: ATGG
Deep Learning For Java
2.2.4 neural network
- Fully connected Layer
- Multiplication + Addiction + Transformation. Scale to sum of one at last.
- The process for calculation:
2.3. Optimizing parameters to get the best performance:
- Calibrate: Using 3xCV to estimate 30 groups of parameters and select the best one
- Train: Repeat Calibrate process several times.
- Test: Using the best parameters in a non-used data for testing.
- Store these group of parameters for predicting new data without training.
2.4. Quantitative performance on various types of held-out experimental test data.
2.4.1 DNA binding
in vitro Micro-array , in vivo ChIP , better performance
2.4.2 RNAcomplete micro-array
better than formal methods
Check in TF level
2.4.3 Using all peaks rather than top500 peaks will get better result
2.5 Potentially disease-causing genomic variants
- A disrupted SP1 binding site in the LDL-R promoter that leads to familial hypercholesterolemia
- A gained GATA1 binding site that disrupts the original globin cluster promoters.
2.6 RNA binding proteins preference for up-stream and down-stream information
- Exons known to be downregulated by Nova had higher Nova scores in their upstream introns, and exons known to be upregulated by Nova had higher Nova scores in their downstream intron.
- TIA has been shown to upregulate exons when bound to the downstream intron
2.7 What are motifs like in the convolution kernal after training.
Comparing with known databases(DNA, jaspar. RNA, CISBP-RNA):
3. Learning the regulatory code of the accessible genome with Deep CNN.
- A very familiar Deep Learning structure comparing with the NBT 3300 article.
- Source code available.
3.1 Data source
- Encode Project Consortium + Roadmap Project, 164 samples’ BED file for peaks.
- Hg19 genome sequences.
|200 million x (600bp*4)||200 million x 164|
e.g, Y is like:
3.2 The structure of the neural network
- Familiar structure
- More layers
This architecture is recommended by Spearmint
divide all training-samples into many subsets. Using one set to update parameters in order to speed up.
Sebastian Ruder’s blog:
3.2.2: Batch Normalization(BN):
Definition: Input: Values of x over a mini-batch:
Parameters to be learned: Output:
Yes, BN is just like z-score, which can scale values for training to the center of the optimizer, which can help speed up the optimization as well as get a higher accuracy.
Randomly choosing a subset of nodes to train in order to guarantee the robustness of the network.
Journal of Machine Learning Research 15(2014) 1929-1958:
3.3 Basset accurately predicts cell-specific DNA accessibility
- Better than formal method.
- Differences for AUC between cell types.
3.4 The convolution kernal
- x axis Information content is:
y axis Influences reflects the accessibility prediction changes over all cells.
- high influeces but unannotated includes CpG and ATAT boxes.
- 45% kernals could be annotated.
- cell-specific patterns.
3.5 Accessibility and Binding-Sites
- AP-1 complex members includes JUN and JUND
- The open region inclueds JUN/JUND peaks.
- Basset result showed a mutation in FOS motif will induce to the loss of the accessibility.
- Conservation also showed a correlation with signal.
3.6 Using GWAS data to validate.
- Basset score for general GWAS SNP vs causal GWAS SNP.
- Basset report T>C a 85% for causality for vitiligo(白癜风) for rs4409785.
- DNA were opened and CTCF could bind.
- Encode CTCF data for raw reads in rs4409785.
- 21 / 88 Samples were sequences here with 11 have peaks, and almost all sequenced samples have T>C mutation.
4. Deep Learning needs GPU
|type||Tesla K20m GPU||Mac Intel i7-CPU|
本文作者Boqiang Hu, 欢迎评论、交流。
转载请务必标注出处: [20160808 Journal Club]Using Deep Learning to study the gene-regulatory elements.