April 25 Masters Thesis Defense: Ziling Huang

Abstract: The Hadoop Distributed File System (HDFS) is the distributed storage infrastructure for the Hadoop big-data analytics ecosystem. The NameNode of HDFS stores the metadata of the entire filesystem and coordinates the file content placement and retrieval actions of the data storage subsystems, called DataNodes. NameNode architecture has long been viewed as the Achilles' heel of the Hadoop filesystem, as it not only represents a single point of failure, but it also limits the scalability of the storage tier. Since Hadoop is now being deployed at increasing scale, this concern has become more prominent. Various solutions have been proposed to address this issue, but the current solutions are primarily focused on improving availability, omitting improvements to scalability. In this paper, we first present a brief study of the state-of-art solutions for the problem, assessing proposals from both industry and academia. Then we propose a novel distributed NameNode architecture that improves both the availability and scalability of HDFS. We also evaluate the enhanced architecture using a Hadoop cluster, applying both a micro metadata benchmark and the standard Hadoop macro benchmark.

Committee members: Dr. Hong Jiang (Advisor), Dr. Ying Lu and Dr. David Swanson