Skip to the content.

KMC

https://github.com/refresh-bio/KMC

Initially used through iMOKA.

KMC appears to be much faster and less resource intensive than jellyfish.

Trying to use outside of iMOKA.

Run KMC to count all kmers into a database

kmer_len=9
threads=8
maxRam=32
minCounts="0"
file_type="fq"
kmcCounterVal="4294967295"

mkdir kmc_dir

for kmer_len in 11 21 31 ; do
echo $kmer_len
for id in SFHH005aa SFHH005ab SFHH005ac SFHH005ad SFHH005ae ; do
echo $id
ls -1 in/${id}_*.fastq.gz > kmc_dir/kmc_input
./bin/kmc -k${kmer_len} -t${threads} -m${maxRam} -cs${kmcCounterVal} -ci${minCounts} -b -${file_type} @kmc_dir/kmc_input kmc_dir/${id}.${kmer_len} ${TMPDIR} 2>> kmc_dir/kmc.err >> kmc_dir/kmc.out
done
done

Normalize by rescaling by the total number of kmers

for kmer_len in 11 21 31 ; do
echo $kmer_len
for id in SFHH005aa SFHH005ab SFHH005ac SFHH005ad SFHH005ae ; do
#id=${id}.${kmer_len}
echo $id
./bin/kmc_tools transform kmc_dir/${id} dump -s kmc_dir/${id}.raw.tsv
cat kmc_dir/${id}.${kmer_len}.raw.tsv | datamash sum 2 > kmc_dir/${id}.${kmer_len}.total_kmer_count
total_kmer_count=$( cat kmc_dir/${id}.${kmer_len}.total_kmer_count )
awk -v sample=${id} -v total_kmer_count=${total_kmer_count} 'BEGIN{FS=OFS="\t";print "kmer",sample}{$2=$2*1000000000/total_kmer_count;print}' kmc_dir/${id}.${kmer_len}.raw.tsv > kmc_dir/${id}.${kmer_len}.normalized.tsv
done
done

Merge all into a single giant matrix.

Each join can take between 5 and 45 minutes at k=18 so while this is great at not requiring memory, it does take time

for kmer_len in 11 21 31 ; do
echo $kmer_len
i=0
for tsv in kmc_dir/*.${kmer_len}.normalized.tsv ; do
if [ $i -eq 0 ] ; then
base_tsv=${tsv}
else
echo "Joining ${base_tsv} ${tsv}"
join --header -a1 -a2 -e0 -oauto ${base_tsv} ${tsv} > kmc_dir/joined.${kmer_len}.normalized.tmp.${i}.tsv
base_tsv=kmc_dir/joined.${kmer_len}.normalized.tmp.${i}.tsv
fi
i=$[i+1]
done
sed -i -e 's/ /\t/g' ${base_tsv}
done

May want to try to implement something in c explicitly designed for this task

Do one thing and do it well.

./data/20200603-TCGA-GBMLGG-WGS/20210923-iMOKA-tumor-normal-test/merge.c