UA Research Computing

HPC Examples

Recover and Delete /tmp Files

Overview

Sometimes you may want to use a compute node’s local storage, accessed as /tmp, as scratch space for your workflows. However, what happens if your job runs over its allocated walltime and is killed? In this case, you may not be able to retrieve your files before your job ends. This is not ideal for you since you may lose important data, and it’s not ideal for the system since large unclaimed files may lead to nodes being taken offline for manual cleanup.

One solution might be to request a signal from SLURM n seconds before the job’s walltime is reached. This can be used to trigger a trap cleanup function that will recover your files to your account while removing them from the scratch space before the job terminates.

Submission Script

Below is an example submission script that uses the trap method to recover the file it generates before the job is killed.

First, we request that a signal, USR1, is sent n seconds before the job reaches its walltime (additional SLURM documentation on SBATCH directives can be found here: https://slurm.schedmd.com/sbatch.html)

#SBATCH --signal=B:USR1@n

In this case, we’ll request four minutes of runtime and we’ll specify we want our signal sent 30 seconds before that time is reached using:

#SBATCH --time=00:04:00
# asks SLURM to send the USR1 signal 30 seconds before end of the time limit
#SBATCH --signal=B:USR1@30

Next, we’ll define a trap function called cleanup used to retrieve and remove our files from scratch. Since we want to retrieve our files either when we get the USR1 signal or when we get the EXIT signal, we’ll define a variable to determine if we’ve already gone through the cleanup process before so we don’t execute it multiple times. Specifically, if a script is told to clean up after receiving the USR1 signal, then exits with the EXIT signal, we don’t want to rerun our cleanup. We’ll use CLEANUP to catch this in the following function:

cleanup()
{
    if [ -z "$CLEANUP" ] ; then
        export CLEANUP=true
        echo "function 'cleanup' called at $(date)"; echo
        echo "Moving $TMP_PATH/$FILENAME to $SAVE_PATH"
        check_filepath $SAVE_PATH
        [ $? -eq 1 ] && echo "Save filepath doesn't exit. Creating" && mkdir -p $SAVE_PATH
        check_filepath $SAVE_PATH
        cp $TMP_PATH/$FILENAME $SAVE_PATH/$FILENAME; echo
        echo "Deleting $TMP_PATH"
        rm -rf $TMP_PATH ; echo
        echo "Checking"
        check_filepath $TMP_PATH
        exit
    else
        echo "'cleanup' already performed. Exiting"
    fi
}

Using trap 'cleanup' USR1 EXIT will tell our script to execute the cleanup if it receives either the USR1 or EXIT signal.

The full script is included below:

#!/bin/bash
#SBATCH --job-name=save_files_example
#SBATCH --account=hpcteam
#SBATCH --partition=standard
#SBATCH --ntasks=1
#SBATCH --nodes=1
#SBATCH --time=00:04:00
# asks SLURM to send the USR1 signal 30 seconds before end of the time limit
#SBATCH --signal=B:USR1@30


# File/directory names relevant to this job. The path and file defined with
# TMP_PATH and FILENAME will be created as part of the workflow. To run this
# script, 
SAVE_PATH=$SLURM_SUBMIT_DIR/OUTPUT
TMP_PATH=/tmp/$USER
FILENAME=test.out

# Function to elegantly test the existence of a directory
check_filepath()
{
    [ -d $1 ] && echo "Directory $1 exists." && return 0
    [ ! -d $1 ] && echo "Directory $1 does not exist." && return 1
}

# We define a function used to collect our output files and remove anything
# from /tmp that we've generated. Since trap will be called either if USR1 or
# EXIT is detected, 
cleanup()
{
    if [ -z "$CLEANUP" ] ; then
        export CLEANUP=true
        echo "function 'cleanup' called at $(date)"; echo
        echo "Moving $TMP_PATH/$FILENAME to $SAVE_PATH"
        check_filepath $SAVE_PATH
        [ $? -eq 1 ] && echo "Save filepath doesn't exit. Creating" && mkdir -p $SAVE_PATH
        check_filepath $SAVE_PATH
        cp $TMP_PATH/$FILENAME $SAVE_PATH/$FILENAME; echo
        echo "Deleting $TMP_PATH"
        rm -rf $TMP_PATH ; echo
        echo "Checking"
        check_filepath $TMP_PATH
        exit
    else
        echo "'cleanup' already performed. Exiting"
    fi
}

# "trap" will activate our cleanup function either when the USR1 signal is sent
# when we're 30 seconds out from the job hitting its walltime limit or when 
# the job exits
trap 'cleanup' USR1 EXIT

# We'll create a sample workflow that will eat up more time than we've allocated
# to the job. At some point, slurm will send the USR1 signal and this workflow
# will be interrupted and the cleanup will activate
echo "starting calculation at $(date)"

echo "Making $TMP_PATH"
mkdir -p $TMP_PATH
echo "checking" ; check_filepath $TMP_PATH
echo "Performing calculations" >> $TMP_PATH/$FILENAME
for i in $(seq 1 10) 
do
    echo "Iteration $i: $(date)" >> $TMP_PATH/$FILENAME
    sleep 30
    wait
done

wait

Submitting the Job

[netid@junonia ~]$ sbatch save_files.slurm 
Submitted batch job 271237

Output Files

There will be two output files generated by this job. The first is the standard SLURM output that will show up in your submission directory as soon as your job starts running. The contents should show something like:

[netid@junonia ~]$ cat slurm-271237.out 
starting calculation at Wed Feb  9 15:44:40 MST 2022
Making /tmp/netid
checking
Directory /tmp/netid exists.
function 'cleanup' called at Wed Feb  9 15:48:10 MST 2022

Moving /tmp/netid/test.out to /xdisk/netid/Recover-tmp-Files/OUTPUT
Directory /xdisk/netid/Recover-tmp-Files/OUTPUT does not exist.
Save filepath doesn't exit. Creating
Directory /xdisk/netid/Recover-tmp-Files/OUTPUT exists.

Deleting /tmp/netid

Checking
Directory /tmp/netid does not exist.
'cleanup' already performed. Exiting
Detailed performance metrics for this job will be available at https://metrics.hpc.arizona.edu/#job_viewer?action=show&realm=SUPREMM&resource_id=4&local_job_id=271237 by 8am on 2022/02/10.

The recovered file generated on the scratch storage will also be present in a directory called OUTPUT:

[netid@junonia ~]$ cat OUTPUT/test.out 
Performing calculations
Iteration 1: Wed Feb  9 15:44:40 MST 2022
Iteration 2: Wed Feb  9 15:45:10 MST 2022
Iteration 3: Wed Feb  9 15:45:40 MST 2022
Iteration 4: Wed Feb  9 15:46:10 MST 2022
Iteration 5: Wed Feb  9 15:46:40 MST 2022
Iteration 6: Wed Feb  9 15:47:10 MST 2022
Iteration 7: Wed Feb  9 15:47:40 MST 2022