JCL Ideas: How to Restart a step Using Checkpoint

RD in JCL exec statement:

There are two methods of restarting a job: restarting from a step (step restart) and restarting from a checkpoint (checkpoint restart).

Step restart is simpler and doesn’t require the system to take a checkpoint. You simply code a RESTART parameter on the JOB statement to name the step from which to restart and resubmit the job.

A restart can be automatic (the system restarts the job immediately) or deferred (permitting you to examine your output and make the appropriate changes before resubmitting the job). The restart, whether automatic or deferred, is specified by the RD parameter on the EXEC statement. Read my other post on restarting the proc step.

Automatic restart can occur only if the completion code accompanying the step agrees with a set of eligible completion codes specified by the installation, and if the operator consents. Checkout Example: resubmitting a job for a checkpoint restart.

Checkpoint in JCL

Checkpoints consist of a snapshot of a program’s status at selected points during execution so that if the program terminates for some reason, the run can be restarted from the last checkpoint rather than the beginning of the run.
Checkpointing is only done because of the potential cost or time limitations of restarting a large job. The checkpoints themselves are expensive, complex, and require careful planning.
You may not always be able to successfully restart the run—the problem may be caused by a program error that occurred before the checkpoint or by incorrect data.
You use checkpoints more as a protection against hardware, operating system, and operator errors than as protection against application program errors.
When the system takes a checkpoint, it notes the position of each open data set but does not copy it. This can make restarting from a checkpoint difficult.
If you update a data set after the system takes the checkpoint, the system will not return the data set to its original status for the restart. If you subsequently delete a temporary data set, it will not be present for the restart.
The system can take several checkpoints during the execution of the job step. You must include a DD statement in the step to specify the data set to contain the checkpoints.
For sequential data sets, DISP=OLD rewrites each new checkpoint over the previous one.
This is dangerous because if the job terminates while the system takes a checkpoint, there is no usable checkpoint. Coding DISP=MOD writes each new checkpoint beyond the end of the previous one and is safer.
Alternatively, you can make the DD statement point to a partitioned data set on a direct-access volume. The system then adds each checkpoint as a member.