Dissertation Defense: "Finding DNA Motifs: A Probabilistic Suffix Tree Approach"
Abhishek Majumdar
Committee: Dr. Stephen Scott and Dr. Jitender Deogun (Co-Advisor), Dr. Lisong Xu; Dr. Steven Harris; and Dr. Etsuko Moriyama
Thursday, December 15, 2016, 11:30 a.m.
112 Schorr Center
Abstract:
We address the problem of de novo motif identification. That is, given a set of DNA sequences we try to identify motifs in the dataset without having any prior knowledge about existence of any motifs in the dataset. We propose a method based on Probabilistic Suffix Trees (PSTs) to identify fixed-length motifs from a given set of DNA sequences. Our experiments reveal that our approach successfully discovers true motifs. Our experiments on synthetic data show that the motifs found by our method are capable of almost perfectly (Area Under ROC curve ≈ 0.987) distinguishing their sequence clusters from other clusters. We compared our method with the popular MEME algorithm, and observed that it detects a larger number of correct and statistically significant motifs than MEME. Our method is highly efficient as compared to MEME in finding the motifs when processing datasets of 1000 or more sequences. We applied our method to sequences of mutant strains of Exophiala dermatitidis and successfully identified motifs that revealed several transcription factor binding sites. This information is important to biologists for performing experiments to understand their role in different regulatory pathways affected by cdc42. We also show that our PST approach to de novo motif discovery can be used successfully to identify motifs in ChIP-Seq datasets. These motifs in turn identify binding sites for proteins in the sequences.