C4

Developing some notes here.

https://ucsf-cbi.github.io/c4/index.html

Basic Usage

Many TIPCC job submission and queue management commands work on C4, but C4 uses slurm which uses sbatch and squeue or sacct.

Submit jobs

sbatch --job-name HelloWorld --output hello_world.out --account=francislab --partition=common /c4/home/gwendt/tests/hello_world

sbatch --job-name=$(basename $base)  --time=480 --ntasks=8 --mem=8G --output=${outdir}/stdout COMMAND_AND_OPTIONS

Account and partition are likely unnecessary as the environment variable SBATCH_PARTITION will be set and used for this.

View Queue

qstat works for both.

squeue --long -u $USER

Note: Jobs that are dependent on another have a status of Q rather than H as on TIPCC.

Note: Jobs that are dependent on another failed job do not get automatically canceled as on TIPCC.

Hold and Release Jobs

TIPCC used qhold and qrls

C4 uses ... ????

Cancel / Kill / Delete jobs

TIPCC used qdel or qdel all

C4 uses scancel and qdel but qdel all does not work.

Advanced Usage

Array Jobs

Still getting a handle on this.

Seems that when running array jobs, they all use the same resource limits, so if the needs vary, this may not work. You'd need to request the largest possible need to ensure that it is available for that job. That may not be true. Still investigating.

Looks like it goes something like this.

The exact same script is run, simply by changing the array index. The script must then use it to determine what to do.

Again, still investigating and have yet to use an array job.

For example, this would run 10 jobs, 1 through 10, but only 2 at any given time. I guess this is how we play nice, even with yourself. I have found that IO can be abused if a bunch of jobs run at the same time, but don't need a lot of CPU or memory. This would be useful there.

sbatch --output="slurm-%A_%a.out" --array=1-10%2 --wrap="date; echo \$SLURM_ARRAY_TASK_ID; sleep 5"

A script to select a line from a file with something like ...

LINE=$(sed -n "$SLURM_ARRAY_TASK_ID"p File.txt)

There is a maximum task id of 1000 so

sbatch --output="slurm-%A_%a.out" --array=1-1001%2 --wrap="date; echo \$SLURM_ARRAY_TASK_ID; sleep 5"

will fail.

The array script will need to take something like an "page" or "offset" option to handle this.

The status of all array jobs can be viewed with squeue. Detailed information for every job can be seen with scontrol show job <SLURM_ARRAY_JOB_ID>.

squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             103_1   sixhour myJobarr r557e636  R       4:23      1 h001
             103_2   sixhour myJobarr r557e636  R       4:23      1 h001
             103_3   sixhour myJobarr r557e636  R       4:23      1 h001
             103_4   sixhour myJobarr r557e636  R       4:23      1 h001
             103_5   sixhour myJobarr r557e636  R       4:23      1 h001


$ scontrol show job 103
JobId=103 ArrayJobId=103 ArrayTaskId=5 JobName=myJobarrayTest
...
JobId=107 ArrayJobId=103 ArrayTaskId=4 JobName=myJobarrayTest
...
JobId=106 ArrayJobId=103 ArrayTaskId=3 JobName=myJobarrayTest
...
JobId=105 ArrayJobId=103 ArrayTaskId=2 JobName=myJobarrayTest
...
JobId=104 ArrayJobId=103 ArrayTaskId=1 JobName=myJobarrayTest

C4_sample array job

Container Jobs

I've run jobs using containers simply by specifying them in the script or "wrap"

sbatch="sbatch --mail-user=$(tail -1 ~/.forward) --mail-type=FAIL "
export SINGULARITY_BINDPATH=/francislab

${sbatch} --export=SINGULARITY_BINDPATH --parsable --job-name=T${k}iMOKApreprocess --time=1440 --ntasks=${threads} --mem=${sbatch_mem} --output=${kdir}/iMOKA.preprocess.${date}.txt --wrap="singularity exec ${img} preprocess.sh --input-file ${kdir}/source.tsv --kmer-length ${k} --ram $((threads*mem)) --threads ${threads} --keep-files" )

Scratch

It is recommended to and can be beneficial to use "scratch". Scratch is disk space local to the computing node. It is currently the TMPDIR of each job. Reading and writing from it will not travel across the network and can therefore speed up processing. This is most notable when you are creating intermediate files. Have your script copy your input data and references to TMPDIR. Write the output to TMPDIR and then move it somewhere more permanent when done. Slurm will delete the directory when your job ends. There is a limit of about 2TB available on each node. Request it by adding something like --gres=scratch:200G to your sbatch call.

C4_sample scratch job

Control

The scontrol command can be quite useful.

For throttling the number of jobs an array job runs ...

scontrol update JobId=25692 ArrayTaskThrottle=16

For pausing and restarting an array job ...

scontrol hold JobId=25253

scontrol release JobId=25253

For controlling execution order, change the Nice value. The default value is 0. Increase the value to lower the priority. Decrease the value to increase it

scontrol update JobId=25253 Nice=100

Does niceness cross users?

I think it does but it isn't completely clear.

How nice or difference in niceness do jobs need to be for a job needing less resources to not run because a job requiring more resources is behind it? (10 didn't seem to work, but 100 did)

To control what squeue shows, modify SQUEUE_FORMAT and SQUEUE_SORT.

export SQUEUE_FORMAT="%.24i %.16P %.24j %.8u %.2t %.8M %.4D %.4C %.7m %.6y %R"
export SQUEUE_SORT="M,-i"

Miscellaneous

Migrating commands from TIPCC to C4

qsub to sbatch

C4 doesn't seem to work with STDIN as TIPCC did.

echo "sleep 60" | qsub

Doesn't work on C4.

C4 Submit jobs command line options --mem (does qstat show memory request) Scratch

-l feature=nocommunal option

sbatch --parsable --ntasks=8 --mem=16G --gres=scratch:200G hello_world
11782

qstat to squeue or acct

[gwendt@cclc01 ~]$ qstat

cclc01.som.ucsf.edu: 
                                                                                  Req'd    Req'd       Elap
Job ID                  Username    Queue    Jobname          SessID  NDS   TSK   Memory   Time    S   Time
----------------------- ----------- -------- ---------------- ------ ----- ------ ------ --------- - ---------
2006780.cclc01.som.ucs  gwendt      batch    STDIN               --      1      1    2gb  99:23:59 Q       --

C4 has qstat, squeue and sacct for viewing the queue.

sacct --format="JobID,State,ReqMem,ReqGRES"
sacct --format="JobID,State,AllocNodes,AllocGRES,AllocNodes,ReqNodes,ReqMem,ReqGRES"
squeue

squeue --long -u $USER

Sample C4 pipeline

C4_pipeline