C4 Sample Array script

Essentially, an array script in slurm is a script that is written to use the environment variable SLURM_ARRAY_TASK_ID. This variable is just an integer. I use it to select a line from a text file or ls output. Nothing really special other than that.

DO NOT qdel 733558_26. I think that the script parses off the first numeric part and cancels the entire array job!

qdel 733795_2
Argument "733795_2" isn't numeric in subroutine entry at /usr/bin/qdel line 96, <DATA> line 602.

I think that scancel 733558_26 is OK.

#!/usr/bin/env bash
#SBATCH --export=NONE   # required when using 'module'

hostname
echo "Slurm job id:${SLURM_JOBID}:"
date

#set -e  #       exit if any command fails
set -u  #       Error on usage of unset variables
set -o pipefail
if [ -n "$( declare -F module )" ] ; then
	echo "Loading required modules"
	module load CBI samtools/1.13 bowtie2/2.4.4 picard
	#bedtools2/2.30.0
fi
#set -x  #       print expanded command before executing it


OUT="/francislab/data1/working/20220610-EV/20220624-preprocessing_with_custom_umi_sequence_extraction/out"

#while [ $# -gt 0 ] ; do
#	case $1 in
#		-d|--dir)
#			shift; OUT=$1; shift;;
#		*)
#			echo "Unknown params :${1}:"; exit ;;
#	esac
#done

mkdir -p ${OUT}


line=${SLURM_ARRAY_TASK_ID:-1}
echo "Running line :${line}:"


#	Use a 1 based index since there is no line 0.

#r1=$( ls -1 /francislab/data1/raw/20220610-EV/SF*R1_001.fastq.gz | sed -n "$line"p )
sample=$( sed -n "$line"p /francislab/data1/working/20220610-EV/20220624-preprocessing_with_custom_umi_sequence_extraction/metadata.csv | awk -F, '{print $1}' )
r1=$( ls /francislab/data1/raw/20220610-EV/${sample}_*R1_001.fastq.gz )

#	Make sure that r1 is unique. NEEDS the UNDERSCORE AFTER SAMPLE!


echo $r1

if [ -z "${r1}" ] ; then
	echo "No line at :${line}:"
	exit
fi

It is called like the following command, where the --array option is used to set the range of values to set the environment variable to. It can be a range like 10-100, a list like 5,8,12,99, or just a single number 13. It is followed by the number of jobs you'd like it to try to run at the same time like %1 or %8. Setting it to 0 means it will try to run as many as it can, I think.

sbatch --mail-user=$(tail -1 ~/.forward)  --mail-type=FAIL --array=1-86%1 --job-name="preproc" --output="/francislab/data1/working/20220610-EV/20220624-preprocessing_with_custom_umi_sequence_extraction/logs/preprocess.${date}-%A_%a.out" --time=2880 --nodes=1 --ntasks=8 --mem=60G --gres=scratch:250G /francislab/data1/working/20220610-EV/20220624-preprocessing_with_custom_umi_sequence_extraction/array_wrapper.bash

You can modify the array job handler, but some things are restricted. If you'd like to throttle up or down the number of jobs running you can use a command like

scontrol update ArrayTaskThrottle=6 JobId=352083

Single self-calling example script

#!/usr/bin/env bash
#SBATCH --export=NONE   # required when using 'module'

hostname
echo "Slurm job id:${SLURM_JOBID}:"
date

#		Will be "slurm_script" at runtime
if [ -n "${SLURM_JOB_NAME}" ] ; then
	script=${SLURM_JOB_NAME}
else
	script=$( basename $0 )
fi

#	PWD preserved by slurm for where job is run? I guess so.
arguments_file=${PWD}/${script}.arguments

if [ -n "${SLURM_ARRAY_TASK_ID}" ] ; then

	set -e	#	exit if any command fails
	set -u	#	Error on usage of unset variables
	set -o pipefail
	if [ -n "$( declare -F module )" ] ; then
		echo "Loading required modules"
		module load CBI bcftools/1.15.1
	fi
	set -x	#	print expanded command before executing it

	line=${SLURM_ARRAY_TASK_ID:-1}
	echo "Running line :${line}:"

	#	Use a 1 based index since there is no line 0.

	args=$( sed -n "$line"p ${arguments_file} )
	echo $args

	if [ -z "${args}" ] ; then
		echo "No line at :${line}:"
		exit
	fi

	#	Do what you gotta do

	echo "Complete"

else

	date=$( date "+%Y%m%d%H%M%S" )

	cat <<- EOF > ${arguments_file}
SFHH011AC
SFHH011BB
SFHH011BZ
SFHH011CH
SFHH011I
SFHH011S
SFHH011Z
SFHH011BO
	EOF

	#	or some file specific list
	#	ls -1 in/*bam | xargs -I% basename % > ${arguments_file}

	max=$( cat ${arguments_file} | wc -l )

	mkdir -p ${PWD}/logs/
	date=$( date "+%Y%m%d%H%M%S%N" )
	sbatch --mail-user=$(tail -1 ~/.forward)  --mail-type=FAIL \
		--array=1-${max}%16 --job-name=${script} \
		--output="${PWD}/logs/${script}.${date}.%A_%a.out" \
		--time=1440 --nodes=1 --ntasks=4 --mem=30G \
		$( realpath ${0} )

fi

There are many ways to do this and I have expanded on this since its writing.

Too large

There are limits on how many jobs can be in your array job. Not sure why, but there is. Perhaps there is an additional limit on the number of total jobs that can be submitted. If you cross that line, you'll see something like ...

sbatch: error: QOSMaxSubmitJobPerUserLimit
sbatch: error: Batch job submission failed: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)

If this is expected to be common, one could create a single job with ample resources and run all of their "jobs" with parallel. This clearly won't be as flexible once started as it is effectively just 1 slurm job.

Something like ...

echo Command1 > commands.txt
echo Command2 >> commands.txt
echo Command3 >> commands.txt
#etc.

parallel -j ${SLURM_NTASKS:-32} < commands.txt

These are 2 completely different job submission types so you'd need to write 2 different scripts.