IBM Research has developed a concept called Adaptive MapReduce, which extends Hadoop by making individual mappers self-aware and aware of other mappers. This approach enables individual map tasks to adapt to their environment and make efficient decisions.
When a MapReduce job is about to begin, Hadoop divides the data into many pieces, called splits. Each split is assigned a single mapper.
To ensure a balanced workload, these mappers are deployed in waves, and new mappers start once old mappers finish processing their splits.
In this model, a small split size means more mappers, which helps ensure balanced workloads and minimizes failure costs.
However, smaller splits also result in increased cluster overhead due to the higher volumes of startup costs for each map task.
For workloads with high startup costs for map tasks, larger split sizes tend to be more efficient. An adaptive approach to running map tasks gives BigInsights the best of both worlds.
One implementation of Adaptive MapReduce is the concept of an adaptive mapper. Adaptive Mappers extend the capabilities of conventional Hadoop mappers by tracking the state of file splits in a central repository.
Each time an Adaptive Mapper finishes processing a split, it consults this central repository and locks another split for processing until the job is completed. This means that for Adaptive Mappers, only a single wave of mappers is deployed, since the individual mappers remain open to consume additional splits.
The performance cost of locking a new split is far less than the startup cost for a new mapper, which accounts for a significant increase in performance.
The Adaptive Mappers result (see the AM bar) was based on a low split size of 32 MB. Only a single wave of mappers was used, so there were significant performance savings based on avoiding the startup costs for additional mappers.