Skip to the content.

Reference Masking

Using Repeat Masker to block non-specific regions from a reference to minimize false identification.

Aligning raw reads to a viral reference can result in small regions with extremely deep coverage and no coverage anywhere else.

We were find a BeAn_58058 virus quite often.

Many times simply running RepeatMasker on the viral fasta before creating the reference does a good job.

However, sometimes even that isn't good enough.

We've gone so far as to chop up the viral fasta and align it to a human reference and then mask those regions that aligned as well.

Setup

for f in NC_00????.?.masked.fasta ; do
for s in 50 25 ; do

	echo $f

	b=$( basename $f .fasta )

Split the fasta file into pieces. We chose 50bp with an extra 50bp overlap into a single file. And renamed the output file.

	faSplit -oneFile -extra=${s} size ${f} ${s} split
	mv split.fa ${b}.split.${s}.fa

In this particular case, we needed a human reference that did not include EBV. Aligning locally performed best. We attempted a number of settings but stuck with --very-sensitive-local

	bowtie2 -x hg38-noEBV -f -U ${b}.split.${s}.fa --very-sensitive-local --no-unal -S ${b}.split.${s}.loc.sam

Above and below NEED the complete reference (with its version number as it is in the fasta)

The sequence names are sequential so we can calculate the positions based on this number, the size and the overlap. We do this in the first awk command.

The second awk command attempts to expand and merge these regions.


	echo make bed
	samtools view ${b}.split.${s}.loc.sam | awk -v s=${s} -v ref=${b%.masked} '{
		sub(/^split/,"",$1);
		a=1+s*$1
		b=a+(2*s-1)
		print ref"\t"a"\t"b
	}' | awk 'BEGIN{FS=OFS="\t"}
	(NR==1){
		r=$1
		if($2>50){
			s=$2-50
		}else{
			s=1
		}
		e=$3+50
	}
	(NR!=1){
		if( $2 <= (e+50+1) ){
			e=$3+50
		}else{
			print $1,s,e
			s=$2-50
			e=$3+50
		}
	}
	END{
		print r,s,e
	}' > ${b}.${s}.loc.mask.bed

Finally, masking these regions and creating a new fasta using a bedtool

	echo maskFastaFromBed
	maskFastaFromBed -fi ${f} -fo ${b}.${s}.loc-masked.fa -bed ${b}.${s}.loc.mask.bed -fullHeader

done
done

Sadly, I still find a few regions of extreme depth. This requires further investigation. Align to a better, newer, human reference?

For example,

$ RepeatMasker NC_001716.2.fasta
$ mv NC_001716.2.fasta.masked NC_001716.2.masked.fasta
$ head NC_001716.2.masked.fasta
>NC_001716.2 Human herpesvirus 7, complete genome
CCCCCCGTTTCGTATTTCAAATCCTAAATAACCCCCGGGGGGTAAAAAAA
GGGGGGGAGCTAACCCTAACCCTAGCTCTAACCCTAACCCTAACCCTAGC
TCTAACCCTAACCCTAACCCTAAGTCTAACCCTGACCCTAACCCTAAGTC
TAACCCTGACCCTAACCCTAACCCTAACCCTAGCTCTAACCCTAACCCTA
GCTCTAACCCTAACCCTAACCCTAACCCTAGCCCTAACCCTAACCCTAAC
CCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAGGTC
TAACCCTAACCCTAACCCTAAGTCTAACCCTAACCCTAAGTCTAACCCTA
ACCCTAACCCTAACCCTAACCCTAACCCTGACCCTAACCCTAGCTCTAAC
CCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC

$ head NC_001716.2.masked.split.50.fa
>split0000
CCCCCCGTTTCGTATTTCAAATCCTAAATAACCCCCGGGGGGTAAAAAAA
GGGGGGGAGCTAACCCTAACCCTAGCTCTAACCCTAACCCTAACCCTAGC
>split0001
GGGGGGGAGCTAACCCTAACCCTAGCTCTAACCCTAACCCTAACCCTAGC
TCTAACCCTAACCCTAACCCTAAGTCTAACCCTGACCCTAACCCTAAGTC
>split0002
TCTAACCCTAACCCTAACCCTAAGTCTAACCCTGACCCTAACCCTAAGTC
TAACCCTGACCCTAACCCTAACCCTAACCCTAGCTCTAACCCTAACCCTA
>split0003

$ samtools view NC_001716.2.masked.split.50.loc.sam | head
split0000	0	chr10	8468447	9	59S41M	*	0	0	CCCCCCGTTTCGTATTTCAAATCCTAAATAACCCCCGGGGGGTAAAAAAAGGGGGGGAGCTAACCCTAACCCTAGCTCTAACCCTAACCCTAACCCTAGC	IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII	AS:i:82	XS:i:66	XN:i:0	XM:i:0	XO:i:0	XG:i:0	NM:i:0	MD:Z:41	YT:Z:UU
split0001	0	chr10	8468447	14	9S87M4S	*	0	0	GGGGGGGAGCTAACCCTAACCCTAGCTCTAACCCTAACCCTAACCCTAGCTCTAACCCTAACCCTAACCCTAAGTCTAACCCTGACCCTAACCCTAAGTC	IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII	AS:i:134	XS:i:106	XN:i:0	XM:i:5	XO:i:0	XG:i:0	NM:i:5MD:Z:53T4T5C0C8A12	YT:Z:UU
split0002	16	chr18	80263159	11	16M2D83M1S	*	0	0	TAGGGTTAGGGTTAGAGCTAGGGTTAGGGTTAGGGTTAGGGTCAGGGTTAGACTTAGGGTTAGGGTCAGGGTTAGACTTAGGGTTAGGGTTAGGGTTAGA	IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII	AS:i:131	XS:i:118	XN:i:0	XM:i:7	XO:i:1XG:i:2	NM:i:9	MD:Z:16^GG1T24T8G0G13T8G0G22	YT:Z:UU
split0003	0	chr10	8468465	11	11S87M2S	*	0	0	TAACCCTGACCCTAACCCTAACCCTAACCCTAGCTCTAACCCTAACCCTAGCTCTAACCCTAACCCTAACCCTAACCCTAGCCCTAACCCTAACCCTAAC	IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII	AS:i:134	XS:i:132	XN:i:0	XM:i:5	XO:i:0	XG:i:0NM:i:5	MD:Z:35T3A0T0C29T15	YT:Z:UU
split0004	16	chr18	80263081	11	4S95M1S	*	0	0	GACCTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGCTAGGGTTAGGGTTAGGGTTAGGGTTAGAGC	IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII	AS:i:182	XS:i:178	XN:i:0	XM:i:1	XO:i:0	XG:i:0NM:i:1	MD:Z:65T29	YT:Z:UU
split0005	0	chr5	11199	1	100M	*	0	0	CCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAGGTCTAACCCTAACCCTAACCCTAAGTCTAACCCTAACCCTAAGTCTAACCCTA	IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII	AS:i:144	XS:i:144	XN:i:0	XM:i:7	XO:i:0	XG:i:0	NM:i:7MD:Z:46A0C0C22C0C16C0C9	YT:Z:UU
split0006	16	chr22	50808031	11	8S92M	*	0	0	GTTAGAGCTAGGGTTAGGGTCAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGACTTAGGGTTAGGGTTAGACTTAGGGTTAGGGTTAGGGTTA	IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII	AS:i:152	XS:i:142	XN:i:0	XM:i:4	XO:i:0	XG:i:0NM:i:4	MD:Z:12T38G0G16G22	YT:Z:UU
split0007	16	chr12	133265045	1	100M	*	0	0	GGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGAGCTAGGGTTAGGGTCAGGGTTAGGGTTAGGGTTAGGGTTAGGGT	IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII	AS:i:176	XS:i:176	XN:i:0	XM:i:3	XO:i:0	XG:i:0NM:i:3	MD:Z:55G1T12T29	YT:Z:UU
split0008	16	chr18	80263033	1	100M	*	0	0	TAGGGTTAGACTTAGAGTTAGGGTTAGACTTAGGGTTAGGGTTAGACTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGG	IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII	AS:i:144	XS:i:144	XN:i:0	XM:i:7	XO:i:0	XG:i:0NM:i:7	MD:Z:9G0G4G11G0G16G0G53	YT:Z:UU
split0009	16	chr12	133265167	1	95M5S	*	0	0	GTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGACTTAGAGTTAGGGTTAGACTTAGGGTTAGGGTTAGACTTA	IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII	AS:i:150	XS:i:150	XN:i:0	XM:i:5	XO:i:0	XG:i:0NM:i:5	MD:Z:59G0G4G11G0G16	YT:Z:UU

$ cat NC_001716.2.masked.50.loc.mask.bed
NC_001716.2	1	4200
NC_001716.2	7851	8050
NC_001716.2	45801	46000
NC_001716.2	50651	50900
NC_001716.2	108051	108250
NC_001716.2	122051	122300
NC_001716.2	143001	147250
NC_001716.2	150901	151100

DustMasker

I was recently reminded of another product, dustmasker, which comes with blast. I have yet to use it though.

It is simple to use, but I don't quite understand the output. It soft-masks the output by downcasing some sequences. At a glance, it appears to mask A LOT of sequence, some of which doesn't look to be "low complexity".

I'm not sure how all aligners deal with soft-masked sequences.

$ dustmasker -in viral.genomic.fa -outfmt fasta -out viral.genomic.dust.default.fa

$ sdiff viral.genomic.fa viral.genomic.dust.default.fa | head

>NC_001798.2 Human herpesvirus 2 strain HG52, complete genome	>NC_001798.2 Human herpesvirus 2 strain HG52, complete genome
AGTCCCCGTCTTGCCGCGCGGGGGCGGGCGCGGGAAAAAAGCCGCGCGGGGGCGCCCGCGG |	AGTCCCCGTCTTGCcgcgcgggggcgggcgcgggaaaaaagccgcgcgggggcgcccgcg
GCGGGGGGAGGGGCGGCGCCCGCGGGGGAGCGGCCGGCTCCGGGGGAGGGACGGGGAAGGG |	ggaaggcagccccgcggcgcgcggggggaggggcggcgcccgcgggggagcggccggctc
CCGCCCGCCCGCCGCCGCCGCCCGCCTTCGCGCCCCCCCCCAAAAAACACCCCCCCCGGGG |	cgggggagggacggggaagggggcgcgcggggctgccctgccgcccgcccgccgccgccg
AGAGGCGGGGCGGGAGTCCCCGTCCTGCCGCCGCCCCTTAAGAGGGCCCGCAACACGGCCC |	cccgccttcgcgcccccccccaaaaaacaccccccccgGGGGTTGACTCCCCGGGGGAAA
GGACGGGTGAGTTCGCTAGGCAAGCACGGACTGGCGGTTACACGTGCATGCGTGCCGAGTG |	AGAGGCGGGGCGGGAGTCCCCGTCCTGCCGCCGCCCCTTAAGAGGGCCCGCAACACGGCC
GCTCCGGCTCCGGGCCTACGCCGAGCCCAGCCGCCCGCCATGTCCCGCCGCCGGGGTCCCC |	CGGGCTGCGCACGCCAGCCGGGACGGGTGAGTTCGCTAGGCAAGCACGGACTGGCGGTTA
CCGGCCGCGCCCCGGCGCTCCAGCCGTGCCGCGCCCCGGCGCTCCAGCCGTGCCGCGCCCC |	CACGTGCATGCGTGCCGAGTGAACTCTCCCGCCCCGACGCGCtccggctccgggcctacg
ACTCCCAAATGGTCCCTGCGTACGACTCGGGAACCGCGGTCGAGAGCGCGCCGGCCGCGTC |	ccgagcccagccgcccgccatgtcccgccgccggggtccccgccgccggggtccccggcg
CTGCTGGTGCCCCAGGCGGACGACAGCGACGACGCGGACTACGCCGGCAACGACGACGCAG |	ccggccgcgccccggcgctccagccgtgccgcgccccggcgctccagccgtgccgcgccc