The two top tools to schedule a job in Hadoop are Capacity and Fair. Before diving into the scheduling tools, here is a handy guide to refer to on job scheduling basics
Job scheduling process in Hadoop
Moving on to tasks, Hadoop schedules tasks in FIFO order (Firs-in-First-out). Here is the best book Hadoop workflow scheduler using Oozie to get a fair idea on scheduling the jobs.
1. Capacity Scheduler
- Capacity handles large clusters that are shared among multiple organizations or groups.
- According to Hadoop documentation the capacity scheduler is designed to run Hadoop applications as a shared, multi-tenant cluster in an operator-friendly manner while maximizing the throughput and the utilization of the cluster.
Components in Capacity
In this multi-tenancy environment, a cluster can have many job types and multiple job priorities.
As it is designed for situations in which clusters need to support multi-tenancy, its resource sharing is more stringent so as to meet capacity, security, and resource guarantees.
Resources are allocated to queues and are shared among the jobs on that queue. It is possible to set soft and hard limits on queue-based resources.
In a multi-tenancy cluster, security is a major concern. Capacity uses access control lists (ACLs) to manage queue-based job access.
It also permits per-queue administration, so that you can have different settings on the queues.
Free resources from under-utilized queues can be assigned to queues that have reached their capacities. When needed elsewhere, these resources can then be reassigned, thereby maximizing utilization.
In a multi-tenancy environment, a single user’s rogue job could possibly soak up multiple tenants’ resources, which would have a serious impact on job-based service-level agreements (SLAs).
The Capacity scheduler provides a range of limits for these multiple jobs, users, and queues so as to avoid this problem.
Capacity uses an algorithm that supports memory-based resource scheduling for jobs that are resource intensive.
When used with Hadoop V2, Capacity supports a hierarchy of queues, so that underutilized resources are first shared among subqueues before they are then allocated to other cluster tenant queues.
In Hadoop V1, the scheduler supports scheduling by job priority.
Capacity enables you to change the configuration of a queue at run-time via a console that permits viewing of the queues.
In Hadoop V2, you can also stop a queue to let it drain.
2. Fair Scheduler
- Fair aims to do what its name implies: share resources fairly among all jobs within a cluster that is owned and used by a single organization. Over time, it aims to share resources evenly to job pools.
- According to Hadoop documentation the Fair is – Fair scheduling is a method of assigning resources to applications such that all apps get, on average, an equal share of resources over time
Components in Fair
This scheduler organizes jobs into pools, with resources shared among the pools. Attributes, like priorities, act as weights when the resources are shared.
You can specify a minimum level of resources to a pool. If a pool is empty, then Fair shares the resources of other pools.
With Fair, you can specify concurrent job limits by user and pool so as to limit the load on the cluster.
The role of work-flow manager
Capacity, Fair, and similar plug-in schedulers deal with resources allocated to individual jobs over a period of time. However, what about the relationships between jobs and the dependencies between them?
That’s where workflow managers, like Apache’s Oozie, come in.
The critical functionality of Oozie is – Oozie offer the ability to schedule jobs that are organized into workflows by time and event. The deisgn of work-flow in Oozie is done by using XML. Refer more of design of work-flows.
You must be logged in to post a comment.