You came up to this point since you already know about Hadoop. So, if want to solve your big data problems you need right platform.
This platform most popularly called as Hadoop eco system.
See the below hadoop eco system image I have collected it from Edureka for your quick reference. This eco system comprises of 14 technologies. My intention here is to know applications of all the technologies involved in this platform.
The below Hadoop eco system diagram well explained about total architecture. For beginners this is a good image to refer and get knowledge quickly.
The following are the list of technologies involved:
- HDFS -> Hadoop Distributed File System
- YARN -> Yet Another Resource Negotiator
- MapReduce -> Data processing using programming
- Spark -> In-memory Data Processing
- PIG, HIVE-> Data Processing Services using Query (SQL-like)
- HBase -> NoSQL Database
- Mahout, Spark MLlib -> Machine Learning
- Apache Drill -> SQL on Hadoop
- Zookeeper -> Managing Cluster
- Oozie -> Job Scheduling
- Flume, Sqoop -> Data Ingesting Services
- Solr & Lucene -> Searching & Indexing
- Ambari -> Provision, Monitor and Maintain cluster
Let us see in detail…
- HDFS is the one, which makes it possible to store different types of large data sets (i.e. structured, unstructured and semi structured data).
- The basic functions of YARN is ResourceManager and NodeManager.
- MAP REDUCE
- MapReduce is a software framework which helps in writing applications that processes large data sets using distributed and parallel algorithms inside Hadoop environment.
- In memory data processing
- Facebook created HIVE for people who are fluent with SQL. Thus, HIVE makes them feel at home while working in a Hadoop Ecosystem.
- PIG has two parts: Pig Latin, the language and the pig runtime, for the execution environment. You can better understand it as Java and JVM.
- It supports pig latin language, which has SQL like command structure.
- NOSQL database
- Mahout provides an environment for creating machine learning applications which are scalable
- Apache Spark is a framework for real time data analytics in a distributed computing environment.
- It is a replica of Google Dremel.
- It supports different kinds NoSQL databases and file systems, which is a powerful feature of Drill. For example: Azure Blob Storage, Google Cloud Storage, HBase, MongoDB, MapR-DB HDFS, MapR-FS, Amazon S3, Swift, NAS and local files.
- Apache Zookeeper is the coordinator of any Hadoop job which includes a combination of various services in a Hadoop Ecosystem.
- Consider Apache Oozie as a clock and alarm service inside Hadoop Ecosystem. For Apache jobs, Oozie has been just like a scheduler.
- The Flume is a service which helps in ingesting unstructured and semi-structured data into HDFS.
- Flume only ingests unstructured data or semi-structured data into HDFS.
- While Sqoop can import as well as export structured data from RDBMS or Enterprise data warehouses to HDFS or vice versa.
- SOLAR & LUCENE
- Apache Solr and Apache Lucene are the two services which are used for searching and indexing in Hadoop Ecosystem.
- Ambari is an Apache Software Foundation Project which aims at making Hadoop ecosystem more manageable.
- It is the overall architecture of Hadoop system
Top 3 Hadoop platforms
Still another 7 popular platforms are so popular. You can read here.