TIPCC
Basic Usage
# put jobs on hold ( qhold )
qhold 1442460
qhold $( qstat | grep gwendt | awk -F. '{print $1}' | paste -s )
qhold $( qstat | grep gwendt | grep .sr | awk -F. '{print $1}' | paste -s )
qhold $( qstat | grep gwendt | grep .sr | awk -F. '{print $1}' | paste -s )
qhold $( qstat | grep gwendt | grep hrna_1 | awk -F. '{print $1}' | paste -s )
qhold $( qstat | grep gwendt | grep hjd | awk -F. '{print $1}' | paste -s )
qhold $( qstat | grep gwendt | grep .mjf.17 | awk '( $10 == "Q" )' | awk -F. '{print $1}' | paste -s )
# take jobs off hold ( qrls )
qrls $( qstat | grep gwendt | awk -F. '{print $1}' | paste -s )
qrls $( qstat | grep gwendt | grep hrna_11 | awk -F. '{print $1}' | paste -s )
# Just Running
qstat | awk '($10 == "R")'
# Queued or Running
qstat | awk '($10 ~ /R|Q/)'
checkjob #####
tipcc status --scheduler
echo "echo testing" | qsub -N testing
tipcc node n2
tipcc node n6
tipcc node n38
Modules
Modules that I load
module load CBC
module load htop
module load openssl/1.1.1a
module load zlib
module load python/2.7.10 # 2.7.10 for gdc-client
module load gcc/4.9.2
module load r/3.6.1
module load jdk/8
module load gatk/4.0.2.1
module load coreutils/8.6
module load sqlite
module load cufflinks
module load cmake
module load gawk
module load bedtools2
module load git git-lfs
Best Practices
TIPCC has been described as a "fragile flower". It has many limitations and I've managed to overuse it.
- Don't "stuff" the queue.
- Use fewer, long jobs rather than more, short jobs.
- Use local scratch as much as possible.
Old Local Scratch
Use
SCRATCH_JOB=/scratch/$USER/job/$PBS_JOBID
mkdir -p $SCRATCH_JOB
in your scripts.
In some of my jobs this seems like a rather extreme response. Copying input files and references is hundreds of gigabytes. It would almost seem better to maintain single copies of references on local machines.
Clean up when you're done by adding something like the following to your scripts.
trap "{ cd /scratch/; chmod -R +w $SCRATCH_JOB/; \rm -rf $SCRATCH_JOB/ ; }" EXIT
I've noticed many times that this doesn't actually work. My scripts usually include ...
set -e # exit if any command fails
set -u # Error on usage of unset variables
set -o pipefail
so I'm not sure exactly what circumstances result in not cleaning up.
New Local Scratch
https://ucsf-ti.github.io/tipcc-web/good-practices/using-local-scratch.html
New scratch setup does not need to be cleaned up as it should be done automatically.
Existing scripts could simply have the current few lines regarding SCRATCH_JOB setup
SCRATCH_JOB=/scratch/$USER/job/$PBS_JOBID
mkdir -p $SCRATCH_JOB
trap "{ cd /scratch/; chmod -R +w $SCRATCH_JOB/; \rm -rf $SCRATCH_JOB/ ; }" EXIT
replaced with
SCRATCH_JOB=$TMPDIR
Still testing though.
Array jobs
Array jobs call the same script some number of times and the script must deal with the ARRAYID.
This is most commonly done by parsing a file listing or some file's contents.
The addition of a %NUMBER
is intended to limit the number of jobs that could run at the same time.
Perhaps this minimizes the load on the scheduler by keeping jobs on hold?
In general, do jobs on hold have less impact on the scheduler?
$ jobid=$( echo 'echo ":$PBS_ARRAYID:"; echo ":$PBS_JOBID:"' | qsub -N array -t 1-10%2 -j oe -o array )
$ echo $jobid
1816232[].cclc01.som.ucsf.edu
$ qstat -t ${jobid}
cclc01.som.ucsf.edu:
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
----------------------- ----------- -------- ---------------- ------ ----- ------ ------ --------- - ---------
1816232[1].cclc01.som. gwendt batch array-1 -- 1 1 2gb 99:23:59 Q --
1816232[2].cclc01.som. gwendt batch array-2 -- 1 1 2gb 99:23:59 Q --
1816232[3].cclc01.som. gwendt batch array-3 -- 1 1 2gb 99:23:59 H --
1816232[4].cclc01.som. gwendt batch array-4 -- 1 1 2gb 99:23:59 H --
1816232[5].cclc01.som. gwendt batch array-5 -- 1 1 2gb 99:23:59 H --
1816232[6].cclc01.som. gwendt batch array-6 -- 1 1 2gb 99:23:59 H --
1816232[7].cclc01.som. gwendt batch array-7 -- 1 1 2gb 99:23:59 H --
1816232[8].cclc01.som. gwendt batch array-8 -- 1 1 2gb 99:23:59 H --
1816232[9].cclc01.som. gwendt batch array-9 -- 1 1 2gb 99:23:59 H --
1816232[10].cclc01.som gwendt batch array-10 -- 1 1 2gb 99:23:59 H --
$ cat array-9
:9:
:1816232[9].cclc01.som.ucsf.edu:
Missing
Big items missing from the cluster, all mostly due to the very old OS.
Some issues with Jupyter Notebook command line execution and conversion
Python3
RepeatMasker
Docker
MySQL / MariaDB
Megan and Megan Tools
spark
MetaGO requires dsk 1.6066 (and parse_results) which won't run
./dsk
./dsk: /opt/gcc/gcc-4.9.2/lib64/libstdc++.so.6: version `GLIBCXX_3.4.21' not found (required by ./dsk)
./dsk: /lib64/libc.so.6: version `GLIBC_2.14' not found (required by ./dsk)
./parse_results
./parse_results: /opt/gcc/gcc-4.9.2/lib64/libstdc++.so.6: version `GLIBCXX_3.4.21' not found (required by ./parse_results)
./parse_results: /lib64/libc.so.6: version `GLIBC_2.14' not found (required by ./parse_results)
from gotcloud
ftp://share.sph.umich.edu/gotcloud/1.16/gotcloud-bin_1.16.tar.gz
[gwendt@cclc01 ~/gotcloud/bin]$ ./vcfCooker
./vcfCooker: /lib64/libc.so.6: version `GLIBC_2.15' not found (required by ./vcfCooker)
./vcfCooker: /lib64/libc.so.6: version `GLIBC_2.14' not found (required by ./vcfCooker)