|
Hadoop MapReduce assumes your map and reduce tasks are idempotent. This
means the map and reduce tasks can be run any number of times with the same in-
put and produce the same output state. This allows MapReduce to provide fault tol-
erance in job execution and also take maximum advantage of cluster processing
power. You must take care, then, when performing stateful operations. HBase’s In-
crement command is an example of such a stateful operation.
For example, suppose you implement a row-counting MapReduce job that maps over
every key in the table and increments a cell value. When the job is run, the JobTracker
spawns 100 mappers, each responsible for 1,000 rows. While the job is running, a
disk drive fails on one of the TaskTracker nodes. This causes the map task to fail,
and Hadoop assigns that task to another node. Before failure, 750 of the keys were
counted and incremented. When the new instance takes up that task, it starts again
at the beginning of the key range. Those 750 rows are counted twice.
Instead of incrementing the counter in the mapper, a better approach is to emit
["count",1] pairs from each mapper. Failed tasks are recovered, and their output
isn’t double-counted. Sum the pairs in a reducer, and write out a single value from
there. This also avoids an unduly high burden being applied to the single machine
hosting the incremented cell.
Another thing to note is a feature called speculative execution. When certain tasks
are running more slowly than others and resources are available on the cluster, Ha-
doop schedules extra copies of the task and lets them compete. The moment any
one of the copies finishes, it kills the remaining ones. This feature can be enabled/
disabled through the Hadoop configuration and should be disabled if the MapReduce
jobs are designed to interact with HBase.
|
|