Hadoop Job Scheduling Tools: Capacity Vs. Fair

The two top tools to schedule a job in Hadoop are Capacity and Fair. Before diving into the scheduling tools, here is a handy guide to refer to on job scheduling basics

Hadoop is a batch processing ecosystem. So it cannot analyze data on-the-fly.
Hadoop Scheduler

Job scheduling process in Hadoop

Moving on to tasks, Hadoop schedules tasks in FIFO order (Firs-in-First-out). Here is the best book Hadoop workflow scheduler using Oozie to get a fair idea on scheduling the jobs.

1. Capacity Scheduler

Capacity handles large clusters that are shared among multiple organizations or groups.
According to Hadoop documentation the capacity scheduler is designed to run Hadoop applications as a shared, multi-tenant cluster in an operator-friendly manner while maximizing the throughput and the utilization of the cluster.

Components in Capacity

In this multi-tenancy environment, a cluster can have many job types and multiple job priorities.

Organization

As it is designed for situations in which clusters need to support multi-tenancy, its resource sharing is more stringent so as to meet capacity, security, and resource guarantees.

Capacity

Resources are allocated to queues and are shared among the jobs on that queue. It is possible to set soft and hard limits on queue-based resources.

Security

In a multi-tenancy cluster, security is a major concern. Capacity uses access control lists (ACLs) to manage queue-based job access.

It also permits per-queue administration, so that you can have different settings on the queues.

Elasticity

Free resources from under-utilized queues can be assigned to queues that have reached their capacities. When needed elsewhere, these resources can then be reassigned, thereby maximizing utilization.

Multi-tenancy

In a multi-tenancy environment, a single user’s rogue job could possibly soak up multiple tenants’ resources, which would have a serious impact on job-based service-level agreements (SLAs).

The Capacity scheduler provides a range of limits for these multiple jobs, users, and queues so as to avoid this problem.

Resource-based Scheduling

Capacity uses an algorithm that supports memory-based resource scheduling for jobs that are resource intensive.

Hierarchical Queues

When used with Hadoop V2, Capacity supports a hierarchy of queues, so that underutilized resources are first shared among subqueues before they are then allocated to other cluster tenant queues.

Job Priorities

In Hadoop V1, the scheduler supports scheduling by job priority.

Operability

Capacity enables you to change the configuration of a queue at run-time via a console that permits viewing of the queues.

In Hadoop V2, you can also stop a queue to let it drain.

Workflow in Hadoop

2. Fair Scheduler

Fair aims to do what its name implies: share resources fairly among all jobs within a cluster that is owned and used by a single organization. Over time, it aims to share resources evenly to job pools.
According to Hadoop documentation the Fair is – Fair scheduling is a method of assigning resources to applications such that all apps get, on average, an equal share of resources over time

Components in Fair

Organization

This scheduler organizes jobs into pools, with resources shared among the pools. Attributes, like priorities, act as weights when the resources are shared.

Resource Sharing

You can specify a minimum level of resources to a pool. If a pool is empty, then Fair shares the resources of other pools.

Resource Limits

With Fair, you can specify concurrent job limits by user and pool so as to limit the load on the cluster.

The role of work-flow manager

Capacity, Fair, and similar plug-in schedulers deal with resources allocated to individual jobs over a period of time. However, what about the relationships between jobs and the dependencies between them?

That’s where workflow managers, like Apache’s Oozie, come in.

The critical functionality of Oozie is – Oozie offer the ability to schedule jobs that are organized into workflows by time and event. The deisgn of work-flow in Oozie is done by using XML. Refer more of design of work-flows.

Related Posts

Srini

Data Engineer with deep AI and Generative AI expertise, crafting high-performance data pipelines in PySpark, Databricks, and SQL. Skilled in Python, AWS, and Linux—building scalable, cloud-native solutions for smart applications.

Fediverse reactions

Latest Posts

2-Minute Random Topics to Improve English Speaking Skills

June 14, 2026
Spark Structured Streaming vs Spark Declarative Pipelines (SDP): What Every Data Engineer Should Know

June 7, 2026
Linking Words Practice: Improve Your English Speaking Fluency Naturally

May 31, 2026
Databricks DLT with @dp: A Complete Guide to Streaming and Batch Processing

May 31, 2026

Job scheduling process in Hadoop