Hadoop Ecosystem – MapReduce Internal Process

Hadoop’s MapReduce model was introduced by Google. The processing of data in MapReduce is a 2-way process.

MapReduce internal process

  • Map: It is an ingestion and transformation step. Initially, all input records are processed in parallel
  • Reduce: It is an aggregation and stigmatization step. All associated records are processed together by a single entity.

Hadoop ecosystem

  • Developed by apache software. It is open-source.
  • Hadoop Core provides a distributed filesystem (HDFS) and support for the MapReduce distributed computing metaphor.
  • HBase builds on Hadoop Core to provide a scalable, distributed database.
  • Pig is a high-level data-flow language and execution framework for parallel computation. It is built on top of Hadoop Core.
  • ZooKeeper is a highly available and reliable coordination system. Distributed applications use ZooKeeper to store and mediate updates for critical shared-state.
  • Hive is a data warehouse infrastructure built on Hadoop Core that provides data summarization, ad-hoc querying, and analysis of datasets.
  • HDFS: Hadoop distributed file system.

Author: Srini

Experienced software developer. Skills in Development, Coding, Testing and Debugging. Good Data analytic skills (Data Warehousing and BI). Also skills in Mainframe.

One thought

  1. I’ve seen your blog about “Mainframe-How to Modernize Batch Process”. I’m contributing to a open source project with the goal to reproduce a batch execution environment (like on MF) on open system, in cloud. It’s called “JEM, the BBE” and you could find it here: http://www.pepstock.org.
    Hadoop integration is planned as well!
    Let’s hope that could be interesting!

    Like

Comments are closed.