Reference Masking
Using Repeat Masker to block non-specific regions from a reference to minimize false identification.
Aligning raw reads to a viral reference can result in small regions with extremely deep coverage and no coverage anywhere else.
We were find a BeAn_58058 virus quite often.
Many times simply running RepeatMasker on the viral fasta before creating the reference does a good job.
However, sometimes even that isn't good enough.
We've gone so far as to chop up the viral fasta and align it to a human reference and then mask those regions that aligned as well.
Setup
for f in NC_00????.?.masked.fasta ; do
for s in 50 25 ; do
echo $f
b=$( basename $f .fasta )
Split the fasta file into pieces. We chose 50bp with an extra 50bp overlap into a single file. And renamed the output file.
faSplit -oneFile -extra=${s} size ${f} ${s} split
mv split.fa ${b}.split.${s}.fa
In this particular case, we needed a human reference that did not include EBV. Aligning locally performed best. We attempted a number of settings but stuck with --very-sensitive-local
bowtie2 -x hg38-noEBV -f -U ${b}.split.${s}.fa --very-sensitive-local --no-unal -S ${b}.split.${s}.loc.sam
Above and below NEED the complete reference (with its version number as it is in the fasta)
The sequence names are sequential so we can calculate the positions based on this number, the size and the overlap. We do this in the first awk command.
The second awk command attempts to expand and merge these regions.
echo make bed
samtools view ${b}.split.${s}.loc.sam | awk -v s=${s} -v ref=${b%.masked} '{
sub(/^split/,"",$1);
a=1+s*$1
b=a+(2*s-1)
print ref"\t"a"\t"b
}' | awk 'BEGIN{FS=OFS="\t"}
(NR==1){
r=$1
if($2>50){
s=$2-50
}else{
s=1
}
e=$3+50
}
(NR!=1){
if( $2 <= (e+50+1) ){
e=$3+50
}else{
print $1,s,e
s=$2-50
e=$3+50
}
}
END{
print r,s,e
}' > ${b}.${s}.loc.mask.bed
Finally, masking these regions and creating a new fasta using a bedtool
echo maskFastaFromBed
maskFastaFromBed -fi ${f} -fo ${b}.${s}.loc-masked.fa -bed ${b}.${s}.loc.mask.bed -fullHeader
done
done
Sadly, I still find a few regions of extreme depth. This requires further investigation. Align to a better, newer, human reference?
For example,
$ RepeatMasker NC_001716.2.fasta
$ mv NC_001716.2.fasta.masked NC_001716.2.masked.fasta
$ head NC_001716.2.masked.fasta
>NC_001716.2 Human herpesvirus 7, complete genome
CCCCCCGTTTCGTATTTCAAATCCTAAATAACCCCCGGGGGGTAAAAAAA
GGGGGGGAGCTAACCCTAACCCTAGCTCTAACCCTAACCCTAACCCTAGC
TCTAACCCTAACCCTAACCCTAAGTCTAACCCTGACCCTAACCCTAAGTC
TAACCCTGACCCTAACCCTAACCCTAACCCTAGCTCTAACCCTAACCCTA
GCTCTAACCCTAACCCTAACCCTAACCCTAGCCCTAACCCTAACCCTAAC
CCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAGGTC
TAACCCTAACCCTAACCCTAAGTCTAACCCTAACCCTAAGTCTAACCCTA
ACCCTAACCCTAACCCTAACCCTAACCCTGACCCTAACCCTAGCTCTAAC
CCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC
$ head NC_001716.2.masked.split.50.fa
>split0000
CCCCCCGTTTCGTATTTCAAATCCTAAATAACCCCCGGGGGGTAAAAAAA
GGGGGGGAGCTAACCCTAACCCTAGCTCTAACCCTAACCCTAACCCTAGC
>split0001
GGGGGGGAGCTAACCCTAACCCTAGCTCTAACCCTAACCCTAACCCTAGC
TCTAACCCTAACCCTAACCCTAAGTCTAACCCTGACCCTAACCCTAAGTC
>split0002
TCTAACCCTAACCCTAACCCTAAGTCTAACCCTGACCCTAACCCTAAGTC
TAACCCTGACCCTAACCCTAACCCTAACCCTAGCTCTAACCCTAACCCTA
>split0003
$ samtools view NC_001716.2.masked.split.50.loc.sam | head
split0000 0 chr10 8468447 9 59S41M * 0 0 CCCCCCGTTTCGTATTTCAAATCCTAAATAACCCCCGGGGGGTAAAAAAAGGGGGGGAGCTAACCCTAACCCTAGCTCTAACCCTAACCCTAACCCTAGC IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:82 XS:i:66 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:41 YT:Z:UU
split0001 0 chr10 8468447 14 9S87M4S * 0 0 GGGGGGGAGCTAACCCTAACCCTAGCTCTAACCCTAACCCTAACCCTAGCTCTAACCCTAACCCTAACCCTAAGTCTAACCCTGACCCTAACCCTAAGTC IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:134 XS:i:106 XN:i:0 XM:i:5 XO:i:0 XG:i:0 NM:i:5MD:Z:53T4T5C0C8A12 YT:Z:UU
split0002 16 chr18 80263159 11 16M2D83M1S * 0 0 TAGGGTTAGGGTTAGAGCTAGGGTTAGGGTTAGGGTTAGGGTCAGGGTTAGACTTAGGGTTAGGGTCAGGGTTAGACTTAGGGTTAGGGTTAGGGTTAGA IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:131 XS:i:118 XN:i:0 XM:i:7 XO:i:1XG:i:2 NM:i:9 MD:Z:16^GG1T24T8G0G13T8G0G22 YT:Z:UU
split0003 0 chr10 8468465 11 11S87M2S * 0 0 TAACCCTGACCCTAACCCTAACCCTAACCCTAGCTCTAACCCTAACCCTAGCTCTAACCCTAACCCTAACCCTAACCCTAGCCCTAACCCTAACCCTAAC IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:134 XS:i:132 XN:i:0 XM:i:5 XO:i:0 XG:i:0NM:i:5 MD:Z:35T3A0T0C29T15 YT:Z:UU
split0004 16 chr18 80263081 11 4S95M1S * 0 0 GACCTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGCTAGGGTTAGGGTTAGGGTTAGGGTTAGAGC IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:182 XS:i:178 XN:i:0 XM:i:1 XO:i:0 XG:i:0NM:i:1 MD:Z:65T29 YT:Z:UU
split0005 0 chr5 11199 1 100M * 0 0 CCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAGGTCTAACCCTAACCCTAACCCTAAGTCTAACCCTAACCCTAAGTCTAACCCTA IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:144 XS:i:144 XN:i:0 XM:i:7 XO:i:0 XG:i:0 NM:i:7MD:Z:46A0C0C22C0C16C0C9 YT:Z:UU
split0006 16 chr22 50808031 11 8S92M * 0 0 GTTAGAGCTAGGGTTAGGGTCAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGACTTAGGGTTAGGGTTAGACTTAGGGTTAGGGTTAGGGTTA IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:152 XS:i:142 XN:i:0 XM:i:4 XO:i:0 XG:i:0NM:i:4 MD:Z:12T38G0G16G22 YT:Z:UU
split0007 16 chr12 133265045 1 100M * 0 0 GGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGAGCTAGGGTTAGGGTCAGGGTTAGGGTTAGGGTTAGGGTTAGGGT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:176 XS:i:176 XN:i:0 XM:i:3 XO:i:0 XG:i:0NM:i:3 MD:Z:55G1T12T29 YT:Z:UU
split0008 16 chr18 80263033 1 100M * 0 0 TAGGGTTAGACTTAGAGTTAGGGTTAGACTTAGGGTTAGGGTTAGACTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGG IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:144 XS:i:144 XN:i:0 XM:i:7 XO:i:0 XG:i:0NM:i:7 MD:Z:9G0G4G11G0G16G0G53 YT:Z:UU
split0009 16 chr12 133265167 1 95M5S * 0 0 GTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGACTTAGAGTTAGGGTTAGACTTAGGGTTAGGGTTAGACTTA IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:150 XS:i:150 XN:i:0 XM:i:5 XO:i:0 XG:i:0NM:i:5 MD:Z:59G0G4G11G0G16 YT:Z:UU
$ cat NC_001716.2.masked.50.loc.mask.bed
NC_001716.2 1 4200
NC_001716.2 7851 8050
NC_001716.2 45801 46000
NC_001716.2 50651 50900
NC_001716.2 108051 108250
NC_001716.2 122051 122300
NC_001716.2 143001 147250
NC_001716.2 150901 151100
DustMasker
I was recently reminded of another product, dustmasker
, which comes with blast
. I have yet to use it though.
It is simple to use, but I don't quite understand the output. It soft-masks the output by downcasing some sequences. At a glance, it appears to mask A LOT of sequence, some of which doesn't look to be "low complexity".
I'm not sure how all aligners deal with soft-masked sequences.
$ dustmasker -in viral.genomic.fa -outfmt fasta -out viral.genomic.dust.default.fa
$ sdiff viral.genomic.fa viral.genomic.dust.default.fa | head
>NC_001798.2 Human herpesvirus 2 strain HG52, complete genome >NC_001798.2 Human herpesvirus 2 strain HG52, complete genome
AGTCCCCGTCTTGCCGCGCGGGGGCGGGCGCGGGAAAAAAGCCGCGCGGGGGCGCCCGCGG | AGTCCCCGTCTTGCcgcgcgggggcgggcgcgggaaaaaagccgcgcgggggcgcccgcg
GCGGGGGGAGGGGCGGCGCCCGCGGGGGAGCGGCCGGCTCCGGGGGAGGGACGGGGAAGGG | ggaaggcagccccgcggcgcgcggggggaggggcggcgcccgcgggggagcggccggctc
CCGCCCGCCCGCCGCCGCCGCCCGCCTTCGCGCCCCCCCCCAAAAAACACCCCCCCCGGGG | cgggggagggacggggaagggggcgcgcggggctgccctgccgcccgcccgccgccgccg
AGAGGCGGGGCGGGAGTCCCCGTCCTGCCGCCGCCCCTTAAGAGGGCCCGCAACACGGCCC | cccgccttcgcgcccccccccaaaaaacaccccccccgGGGGTTGACTCCCCGGGGGAAA
GGACGGGTGAGTTCGCTAGGCAAGCACGGACTGGCGGTTACACGTGCATGCGTGCCGAGTG | AGAGGCGGGGCGGGAGTCCCCGTCCTGCCGCCGCCCCTTAAGAGGGCCCGCAACACGGCC
GCTCCGGCTCCGGGCCTACGCCGAGCCCAGCCGCCCGCCATGTCCCGCCGCCGGGGTCCCC | CGGGCTGCGCACGCCAGCCGGGACGGGTGAGTTCGCTAGGCAAGCACGGACTGGCGGTTA
CCGGCCGCGCCCCGGCGCTCCAGCCGTGCCGCGCCCCGGCGCTCCAGCCGTGCCGCGCCCC | CACGTGCATGCGTGCCGAGTGAACTCTCCCGCCCCGACGCGCtccggctccgggcctacg
ACTCCCAAATGGTCCCTGCGTACGACTCGGGAACCGCGGTCGAGAGCGCGCCGGCCGCGTC | ccgagcccagccgcccgccatgtcccgccgccggggtccccgccgccggggtccccggcg
CTGCTGGTGCCCCAGGCGGACGACAGCGACGACGCGGACTACGCCGGCAACGACGACGCAG | ccggccgcgccccggcgctccagccgtgccgcgccccggcgctccagccgtgccgcgccc