Hadoop Ecosystem

Hadoop’s MapReduce model was introduced by Google. The processing of data in MapReduce is a 2-way process.

MapReduce internal process

Map: It is an ingestion and transformation step. Initially, all input records are processed in parallel
Reduce: It is an aggregation and stigmatization step. All associated records are processed together by a single entity.

Developed by apache software. It is open-source.
Hadoop Core provides a distributed filesystem (HDFS) and support for the MapReduce distributed computing metaphor.
HBase builds on Hadoop Core to provide a scalable, distributed database.
Pig is a high-level data-flow language and execution framework for parallel computation. It is built on top of Hadoop Core.
ZooKeeper is a highly available and reliable coordination system. Distributed applications use ZooKeeper to store and mediate updates for critical shared-state.
Hive is a data warehouse infrastructure built on Hadoop Core that provides data summarization, ad-hoc querying, and analysis of datasets.
HDFS: Hadoop distributed file system.

Srini

Data Engineer with deep AI and Generative AI expertise, crafting high-performance data pipelines in PySpark, Databricks, and SQL. Skilled in Python, AWS, and Linux—building scalable, cloud-native solutions for smart applications.

One response

stock

April 24, 2013 at 2:58 am

I’ve seen your blog about “Mainframe-How to Modernize Batch Process”. I’m contributing to a open source project with the goal to reproduce a batch execution environment (like on MF) on open system, in cloud. It’s called “JEM, the BBE” and you could find it here: http://www.pepstock.org.
Hadoop integration is planned as well!
Let’s hope that could be interesting!

LikeLike