Hadoop – Getting Started

Big Data has been one of the buzzwords of the last few years, and has become a “thing” because of the enormous amounts of data being generated in recent years – primarily because of the internet. Hadoop is one of the core technologies in the Big Data space, and is certainly the starting point in any Big Data conversation. Hadoop’s components consist of the HDFS file system, MapReduce, and Yarn.

The Hadoop ecosystem includes Hive, HBase, Pig, Flume/Sqoop, Spark, and Oozie. Hive provide a SQL-like interface to Hadoop, and provides a bridge to developers who don’t have Java experience. Hbase is a NoSQL database platform. Pig is a data manipulation language, helping to transform unstructured data into a structured format. We can query this data using some like Hive. Spark is a distributed computing engine used alongside Hadooop – it’s fast, intuitive, and has a large library. Oozie is a tool to schedule workflows on all the Hadoop ecosystem components. Flume\Sqoop are tools that transfer data between other systems and Hadoop.

Three install modes of Hadoop – Standalone which is default, and runs on a single node using a single JVM process. The local file system is used for storage. HDFS and Yarn do not run as they are not required. Standalone is used to test MapReduce programs before deploying to a cluster.

Pseudo-distributed is the second install mode, and runs on a single node. However, there are two JVM processes to simulate two nodes. HDFS is used for stotage, and YARN is used to manage task. It is used as a fully fledged test environment.

Fully distributed runs on a cluster of machines. Manual configuration of a cluster is complicated, so enterprise editions are used to simplify.