Back to: Data Science Tutorials
Map Reduce Framework
In this article, I am going to discuss the Map-Reduce Framework. Please read our previous article, where we discussed What is Hadoop. At the end of this article, you will understand everything about the Map-Reduce Framework.
Map Reduce Framework
Hadoop MapReduce is a software framework that makes it simple to write programs that process enormous volumes of data (multi-terabyte data sets) in parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant way.
A MapReduce job divides the input data set into separate chunks, which are then processed in parallel by the map jobs. The reduction jobs are fed the outputs of the maps, which are then sorted by the framework. In most cases, the job’s input and output are stored in a file system. Tasks are scheduled, monitored, and failed tasks are re-executed by the framework.
The MapReduce framework and the Hadoop Distributed File System (see HDFS Architecture Guide) are typically run on the same set of nodes as the compute nodes. This setting enables the framework to schedule jobs effectively on nodes where data is already present, resulting in extremely high aggregate bandwidth across the cluster.
Each cluster node has one master JobTracker and one slave TaskTracker in the MapReduce framework. The master is in charge of scheduling the slaves’ component tasks, monitoring them, and re-running any failed tasks. The slaves carry out the master’s instructions.
Hadoop can run MapReduce programs written in a variety of languages, including Java, Ruby, Python, and C++. Map Reduce applications in cloud computing are parallel in nature, making them ideal for large-scale data analysis across numerous servers in a cluster.
The Algorithm
Generally, the MapReduce paradigm is based on sending the computer to where the data resides! MapReduce program executes in three stages, namely map stage, shuffle stage and reduce stage.
- Map stage − The map or mapper’s job is to process the input data. Generally, the input data is in the form of a file or directory and is stored in the Hadoop file system (HDFS). The input file is passed to the mapper function line by line. The mapper processes the data and creates several small chunks of data.
- Reduce stage − This stage is the combination of the Shuffle stage and the Reduce stage. The Reducer’s job is to process the data that comes from the mapper. After processing, it produces a new set of output, which will be stored in the HDFS.
During a MapReduce job, Hadoop sends the Map and Reduce tasks to the appropriate servers in the cluster. The framework manages all the details of data-passing such as issuing tasks, verifying task completion, and copying data around the cluster between the nodes. Most of the computing takes place on nodes with data on local disks that reduce the network traffic.
After completion of the given tasks, the cluster collects and reduces the data to form an appropriate result, and sends it back to the Hadoop server.
In the next article, I am going to discuss Hadoop Ecosystem. Here, in this article, I try to explain Map-Reduce Framework and I hope you enjoy this Map-Reduce Framework article.