Hadoop workflow process to understand easily

The scheduling of Jobs in Hadoop is done by two popular tools. Those are ‘Capacity’ and the other one is ‘Fair’. Since Hadoop is a batch process and it is not designed initially for on-the fly data analysis job. The job scheduling very basics given in my slideshare presentation. It is a good to start it.

The point is up to you to decide which scheduler is useful. A better way is go through all Hadoop documentation.

The next step is How Hadoop process its tasks…

  1. The crystal clear point is it follows FIFO – First in First Out process
  2. That means, after completing one task only it allows another task

The real good book on Hadoop work-flow scheduler using Oozie you can refer to get idea and knowledge on how scheduling can be done in Hadoop environment.

The ‘Capacity’ Scheduler

Capacity handles large clusters that are shared among multiple organizations or groups.

According to Hadoop documentation the capacity scheduler is – The CapacityScheduler is designed to run Hadoop applications as a shared, multi-tenant cluster in an operator-friendly manner while maximizing the throughput and the utilization of the cluster.

In this multi-tenancy environment, a cluster can have many job types and multiple job priorities.

Organization: As it is designed for situations in which clusters need to support multi-tenancy, its resource sharing is more stringent so as to meet capacity, security, and resource guarantees.

Capacity: Resources are allocated to queues and are shared among the jobs on that queue. It is possible to set soft and hard limits on queue-based resources.

Security: In a multi-tenancy cluster, security is a major concern. Capacity uses access control lists (ACLs) to manage queue-based job access. It also permits per-queue administration, so that you can have different settings on the queues.

Elasticity: Free resources from under-utilized queues can be assigned to queues that have reached their capacities. When needed elsewhere, these resources can then be reassigned, thereby maximizing utilization.

Multi-tenancy: In a multi-tenancy environment, a single user’s rogue job could possibly soak up multiple tenants’ resources, which would have a serious impact on job-based service-level agreements (SLAs). The Capacity scheduler provides a range of limits for these multiple jobs, users, and queues so as to avoid this problem.

Resource-based Scheduling: Capacity uses an algorithm that supports memory-based resource scheduling for jobs that are resource intensive.

Hierarchical Queues: When used with Hadoop V2, Capacity supports a hierarchy of queues, so that underutilized resources are first shared among subqueues before they are then allocated to other cluster tenant queues.

Job Priorities: In Hadoop V1, the scheduler supports scheduling by job priority.

Operability: Capacity enables you to change the configuration of a queue at runtime via a console that permits viewing of the queues. In Hadoop V2, you can also stop a queue to let it drain.

The other popular Scheduler is ‘Fair’

Fair aims to do what its name implies: share resources fairly among all jobs within a cluster that is owned and used by a single organization. Over time, it aims to share resources evenly to job pools.

According to Hadoop documentation the Fair is – Fair scheduling is a method of assigning resources to applications such that all apps get, on average, an equal share of resources over time

Organization: This scheduler organizes jobs into pools, with resources shared among the pools. Attributes, like priorities, act as weights when the resources are shared.

Resource Sharing: You can specify a minimum level of resources to a pool. If a pool is empty, then Fair shares the resources of other pools.

Resource Limits: With Fair, you can specify concurrent job limits by user and pool so as to limit the load on the cluster.

The role of work-flow manager

Capacity, Fair, and similar plug-in schedulers deal with resources allocated to individual jobs over a period of time. However, what about the relationships between jobs and the dependencies between them?

That’s where workflow managers, like Apache’s Oozie, come in.

The critical functionality of Oozie is – Oozie offer the ability to schedule jobs that are organized into workflows by time and event. The deisgn of work-flow in Oozie is done by using XML. Refer more of design of work-flows.


Author: Srini

Experienced software developer. Skills in Development, Coding, Testing and Debugging. Good Data analytic skills (Data Warehousing and BI). Also skills in Mainframe.

Start Discussion

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s