Abstract: Rapid development of next-generation sequencing technology has led to an unprecedented growth in protein sequence data repositories over the last decade. Majority of these proteins lack structural and functional characterization. This necessitates design and development of fast, efficient and sensitive computational tools and algorithms that can classify these proteins into functionally coherent groups.
Domains are fundamental units of protein structure and function. Multi-domain proteins are extremely complex as opposed to proteins that have single or no domains. They exhibit network-like complex evolutionary events such as domain shuffling, domain loss and gain. These events, therefore, cannot be represented in the conventional protein clustering algorithms like phylogenetic reconstruction and Markov chain clustering. In this thesis, a multi-domain protein classification system is developed primarily based on the domain composition of protein sequences. Using the principle of co-clustering (biclustering), both proteins and domains are simultaneously clustered, where each bicluster contains a subset of proteins and domains forming a complete bipartite graph. These clusters are then converted into a network of biclusters based on the domains shared between the clusters, thereby classifying the proteins into similar protein families.
We applied our biclustering network approach on a multi-domain protein family—Regulator of G-protein Signalling (RGS) proteins, where heterogeneous domain composition exists among subfamilies. Our approach showed consistent clustering with the existing RGS subfamilies. The average Jaccard Index scores for the clusters obtained by Markov Chain Clustering (MCL) and phylogenetic clustering methods against the biclusters were 0.64 and 0.60, respectively. Bicluster networks on complete nine proteomes showed that the number of multi-domain proteins included in connected biclusters rapidly increased with genome complexity, 48.5% in bacteria to 80% in eukaryotes. Our approach uses auxiliary domain information of each protein, and therefore, generates more functionally coherent protein clusters compared to other existing methods.
Protein classification, incorporating such wealth of additonal domain information on protein networks has wide applications and would impact functional analysis and characterization of novel proteins.
Committee Members: Dr. Stephen Scott (Advisor), Etsuko N. Moriyama, and Dr. Ashok Samal