Hadoop workflow process to understand easily

Capacity and Fair are well known tools for Scheduling jobs in Hadoop. We all know Hadoop is a Batch process ecosystem, and it does not have features to analyze the data on-the-fly. The job scheduling basics given in my slide-share presentation. It is a good document to start of it.

The point is up to you to decide which scheduler is useful. A better way is to go through all Hadoop documentation.

Hadoop Scheduler

Handling of tasks in Hadoop

  1. It follows FIFO – First in First Out process
  2. The first-in-first-out basis the tasks are handled in Hadoop.

You can refer the good book Hadoop work-flow scheduler using Oozie to get fair idea on scheduling the jobs in Hadoop ecosystem.

Capacity Scheduler

  • Capacity handles large clusters that are shared among multiple organizations or groups.
  • According to Hadoop documentation the capacity scheduler is designed to run Hadoop applications as a shared, multi-tenant cluster in an operator-friendly manner while maximizing the throughput and the utilization of the cluster.

Components in Capacity

In this multi-tenancy environment, a cluster can have many job types and multiple job priorities.

Organization

As it is designed for situations in which clusters need to support multi-tenancy, its resource sharing is more stringent so as to meet capacity, security, and resource guarantees.

Capacity

Resources are allocated to queues and are shared among the jobs on that queue. It is possible to set soft and hard limits on queue-based resources.

Security

In a multi-tenancy cluster, security is a major concern. Capacity uses access control lists (ACLs) to manage queue-based job access. It also permits per-queue administration, so that you can have different settings on the queues.

Elasticity

Free resources from under-utilized queues can be assigned to queues that have reached their capacities. When needed elsewhere, these resources can then be reassigned, thereby maximizing utilization.

Multi-tenancy

In a multi-tenancy environment, a single user’s rogue job could possibly soak up multiple tenants’ resources, which would have a serious impact on job-based service-level agreements (SLAs). The Capacity scheduler provides a range of limits for these multiple jobs, users, and queues so as to avoid this problem.

Resource-based Scheduling

Capacity uses an algorithm that supports memory-based resource scheduling for jobs that are resource intensive.

Hierarchical Queues

When used with Hadoop V2, Capacity supports a hierarchy of queues, so that underutilized resources are first shared among subqueues before they are then allocated to other cluster tenant queues.

Job Priorities

In Hadoop V1, the scheduler supports scheduling by job priority.

Operability

Capacity enables you to change the configuration of a queue at run-time via a console that permits viewing of the queues. In Hadoop V2, you can also stop a queue to let it drain.

Workflow in Hadoop

Fair Scheduler

  • Fair aims to do what its name implies: share resources fairly among all jobs within a cluster that is owned and used by a single organization. Over time, it aims to share resources evenly to job pools.
  • According to Hadoop documentation the Fair is – Fair scheduling is a method of assigning resources to applications such that all apps get, on average, an equal share of resources over time

Components in Fair

Organization

This scheduler organizes jobs into pools, with resources shared among the pools. Attributes, like priorities, act as weights when the resources are shared.

Resource Sharing

You can specify a minimum level of resources to a pool. If a pool is empty, then Fair shares the resources of other pools.

Resource Limits

With Fair, you can specify concurrent job limits by user and pool so as to limit the load on the cluster.

The role of work-flow manager

Capacity, Fair, and similar plug-in schedulers deal with resources allocated to individual jobs over a period of time. However, what about the relationships between jobs and the dependencies between them?

That’s where workflow managers, like Apache’s Oozie, come in.

The critical functionality of Oozie is – Oozie offer the ability to schedule jobs that are organized into workflows by time and event. The deisgn of work-flow in Oozie is done by using XML. Refer more of design of work-flows.

Advertisements

Author: Srini

Experienced software developer. Skills in Development, Coding, Testing and Debugging. Good Data analytic skills (Data Warehousing and BI). Also skills in Mainframe.