MASTER'S THESIS DEFENSE: "Application of Cosine Similarity in Bioinformatics"
Srikanth Maturu
Committee Members: Dr. Jitender Deogun (Advisor)
Dr. Ashok Samal and Dr. Etsuko Moriyama
Wednesday, December 6, 2017, 1:30 p.m.
347 Avery Hall
Abstract: Finding similar sequences to an input query sequence from a database of sequences is a very important problem in bioinformatics. Knowing if genes, proteins or RNA structures are somehow related is very useful and finding similarities here provide researchers an intuition of what could be related or reduce the search space for a further complex task. An exact near neighbors algorithm such as brute force algorithm used for this task has complexity O(m * n) where n is the database size and m is the query size. Such an algorithm face time complexity issues as the database and query sizes increase. Also, the use of alignment based similarity measures such as minimum edit distance adds additional complexity to the exact algorithm.
In this thesis we use an alignment free method based similarity measures such as cosine similarity and squared euclidean distance by representing sequences as vectors. The cosine similarity based locality-sensitive hashing technique is used to reduce the number of pairwise comparisons while finding similar sequences to an input query.
We evaluated our algorithm on amino acid sequence datasets of size 120,000 and found that our cosine similarity based algorithm is 5-10 times faster than the exact algorithm and have 95% accuracy.