UMI (Unique Molecular Identifier)
Our reads are assembled as such.
P5 - 12bp_UMI - 4bp_TSM - INSERT - PolyA - P7'
TSM = Template-Switch Motif (useless for us?)
The P5 index is removed prior to our analysis.
If the INSERT is short enough, the PolyA and P7 index will be sequenced and included in our data.
cutadapt
or bbduk
can be used to remove it as well as the UMI and TSM.
Prior to this, we use the UMI to remove any duplicate sequencing of this molecule.
Rather than select just one of these reads, we make a consensus sequence.
This is where things get weird.
I presume that there are a number of methodoligies.
It loops over each position in each read and if the quality is above a threshold of 15 it is gathered.
For each base, if the most common base gathered is more than 90% of those gathered, it is returned. If it is not, N is returned.
Both 15 and 90% are adjustable.
This is sound in theory and is perhaps a the best solution.
However, sometimes something are off and its not just a stray bad base call.
For example, ...
All UMIs were copied to the read name.
The reads for a given UMI were extracted and the first 100bp were selected for readability.
zgrep --no-group-separator -A3 " GAGTCGTCCTGC$" umi.fastq.gz | paste - - - - | cut -f2 | cut -c1-100 | sort
GAGTCGTCCTGCAAAAAAAAAAACAAAAAAAAAAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCTCGTATGCCGTCTTCTGCTTGAAA
GAGTCGTCCTGCAAAAAAAAAACAAAAAAAAAAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCTCGTATGCCGTCTTCTGCTTGAAAA
GAGTCGTCCTGCGAAAAAAAAAACAAAAAAAAAAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCTCGTATGCCGTCTTCTGCTTGAAA
GAGTCGTCCTGCGCAAAAAAAAAACAAAAAAAAAAGATCGGAAGAGCAACGTCTGAACTCCAGTCACGCCAATATCTCGTATGCCGTCTTCTGCTTGAAA
GAGTCGTCCTGCGCAAAAAAAAAACAAAAAAAAAAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCTCGTATGCCGTCTTCTGCTTGAA
GAGTCGTCCTGCGCAAAAAAAAAACAAAAAAAAAAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCTCGTATGCCGTCTTCTGCTTGAA
GAGTCGTCCTGCGCGAAAAAAAAAACAAAAAAAAAAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCTCGTATGCCGTCTTCTGCTTGA
GAGTCGTCCTGCGCGGAAAAAAAAAAACAAAAAAAAAAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCTCGTATGCCGTCTTCTGCTT
GAGTCGTCCTGCGCGGAAAAAAAAAACAAAAAAAAAAAGATCGGAAGAGCACACGGCTGAACTCCAGTCAAGCCAATATCTCGTATGCCGTCTTCTGCTT
GAGTCGTCCTGCGCGGAAAAAAAAAACAAAAAAAAAAAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCTCGTATGCCGTCTTCTGCTT
GAGTCGTCCTGCGCGGAAAAAAAAAACAAAAAAAAAAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCTCGTATGCCGTCTTCTGCTTG
GAGTCGTCCTGCGCGGAAAAAAAAAACAAAAAAAAAAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCTCGTATGCCGTCTTCTGCTTG
GAGTCGTCCTGCGCGGAAAAAAAAAACAAAAAAAAAAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCTCGTATGCCGTCTTCTGCTTG
GAGTCGTCCTGCGCGGAAAAAAAAAACAAAAAAAAAAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCTCGTATGCCGTCTTCTGCTTG
GAGTCGTCCTGCGCGGAAAAAAAAAACAAAAAAAAAAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCTCGTATGCCGTCTTCTGCTTG
GAGTCGTCCTGCGCGGAAAAAAAAAACAAAAAAAAAAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCTCGTATGCCGTCTTCTGCTTG
GAGTCGTCCTGCGCGGAAAAAAAAAACAAAAAAAAAAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCTCGTATGCCGTCTTCTGCTTG
GAGTCGTCCTGCGCGGAAAAAAAAAACAAAAAAAAAAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCTCGTATGCCGTCTTCTGCTTG
GAGTCGTCCTGCGCGGAAAAAAAAAACAAAAAAAAAAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCTCGTATGCCGTCTTCTGCTTG
GAGTCGTCCTGCGCGGAAAAAAAAAACAAAAAAAAAAGATCGGAAGAGCACACGTCTGAACTCCAGTTACGCCAATATCTCGTATGCCGTCTTCTGCTTG
GAGTCGTCCTGCGCGGAAGAAAAAAAAAAAAAAAAAAAAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCTCGTATGCCGTCTTCTGCT
GAGTCGTCCTGCGCGGAAGAAAAAAAAAACAAAAAAAAAAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCTCGTATGCCGTCTTCTGC
GAGTCGTCCTGCGCGGAAGAAAAAAAAAACAAAAAAAAAAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCTCGTATGCCGTCTTCTGC
GAGTCGTCCTGCGCGGAAGATAAAAAAAAAACAAAAAAAAAAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCTCGTATGCCGTCTTCT
GAGTCGTCCTGCGCGGAAGATGAAAAAAAAAACAAAAAAAAAAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCTCGTATGCCGTCTTC
GAGTCGTCCTGCGCGGAAGATGAAAAAAAAAACAAAAAAAAAAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCTCGTATGCCGTCTTC
GAGTCGTCCTGCGCGGAAGATGAAAAAAAAAACAAAAAAAAAAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCTCGTATGCCGTCTTC
GAGTCGTCCTGCGCGGAAGATGAAAAAAAAAACAAAAAAAAAAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCTCGTATGCCGTCTTC
GAGTCGTCCTGCGCGGAAGATGCAAAAAAAAAACAAAAAAAAAAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCTCGTATGCCGTCTT
GAGTCGTCCTGCGCGGAAGATGCAAAAAAAAAACAAAAAAAAAAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCTCGTATGCCGTCTT
GAGTCGTCCTGCGCGGAAGATGGAAAAAAAAAACAAAAAAAAAAAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCTCGTATGCCGTCT
GAGTCGTCCTGCGCGGAAGATGTAAAAAAAAAACAAAAAAAAAAGATCGGAAGAGCACACGCCTGAACTCCAGTCACGCCAATATCTCGTATGCCGTCTT
GAGTCGTCCTGCGCGGAAGATGTAAAAAAAAAACAAAAAAAAAAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCTCGTATGCCGTCTT
GAGTCGTCCTGCGCGGAAGATGTAAAAAAAAAACAAAAAAAAAAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCTCGTATGCCGTCTT
GAGTCGTCCTGCGCGGAAGATGTAAAAAAAAAACAAAAAAAAAAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCTCGTATGCCGTCTT
GAGTCGTCCTGCGCGGAAGATGTAAAAAAAAAACAAAAAAAAAAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCTCGTATGCCGTCTT
GAGTCGTCCTGCGCGGAAGATGTAACAAAAAAAAAACAAAAAAAAAAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCTCGTATGCCGT
GAGTCGTCCTGCGCGGAAGATGTAACAGAAAAAAAAAACAAAAAAAAAAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCTCGTATGCC
GAGTCGTCCTGCGCGGAAGATGTAACGAAAAAAAAAAACAAAAAAAAAAGATCGGAGAGCACACGTCTGAACTCCAGTCACGCCAATATCTCGTATGCCG
GAGTCGTCCTGCGCGGAAGATGTAACGCAAAAAAAAAACAAAAAAAAAAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCTCGTATGCC
GAGTCGTCCTGCGCGGAAGATGTAACGGGAAAAAAAAAACAAAAAAAAAAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCTCGTATGC
GAGTCGTCCTGCGCGGAAGATGTAACGGGGCAAAAAAAAAACAAAAAAAAAAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCTCGTAT
GAGTCGTCCTGCGCGGAAGATGTAACGGGGCCAAAAAAAAAAACAAAAAAAAAAGATCGGAAGCGCACACGTCTGAACTCCAGTCACGCCAATCTCTCGC
GAGTCGTCCTGCGCGGAAGATGTAACGGGGCTAAAAAAAAAACAAAAAAAAAAGATCGGAAGACCACACGTCTGAACTCCAGTCACCCCAATATCCCGTA
GAGTCGTCCTGCGCGGAAGATGTAACGGGGCTAAAAAAAAAACAAAAAAAAAAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCTCGTA
GAGTCGTCCTGCGCGGAAGATGTAACGGGGCTAAAAAAAAAACAAAAAAAAAAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCTCGTA
GAGTCGTCCTGCGCGGAAGATGTAACGGGGCTAAGCAAAAAAAAAACAAAAAAAAAAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCT
GAGTCGTCCTGCGCGGAAGATGTAACGGGGCTAAGCAAAAAAAAACAAAAAAAAAAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCTC
GAGTCGTCCTGCGCGGAAGATGTAACGGGGCTAAGCCAAAAAAAAAACAAAAAAAAAAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATC
GAGTCGTCCTGCGCGGAAGATGTAACGGGGCTAAGCCAAAAAAAAACAAAAAAAAAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCTC
GAGTCGTCCTGCGCGGAAGATGTAACGGGGCTAAGCTAAAAAAAAACAAAAAAAAAAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCT
GAGTCGTCCTGCGCGGAAGATGTAACGGGGCTAAGCTAAAGAAAAAAAAAACAAAAAAAAAAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAA
GAGTCGTCCTGCGCGGAAGATGTAACGGGGCTAAGCTATAAAAAAAAAACAAAAAAAAAAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATA
GAGTCGTCCTGCGCGGAAGATGTAACGGGGCTAAGCTATAAAAAAAAAACAAAAAAAAAAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATA
GAGTCGTCCTGCGCGGAAGATGTAACGGGGCTAAGCTATAAAAAAAAAACAAAAAAAAAAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATA
GAGTCGTCCTGCGCGGAAGATGTAACGGGGCTAAGCTATACAAAAAAAAAACAAAAAAAAAAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAA
GAGTCGTCCTGCGCGGAAGATGTAACGGGGCTAAGCTATACAAAAAAAAAACAAAAAAAAAAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAA
GAGTCGTCCTGCGCGGAAGATGTAACGGGGCTAAGCTATATAAAAAAAAAACAAAAAAAAACGTTCGGAAACCCCACCTTCTGACTCCCCGTCACCCCAA
GAGTCGTCCTGCGCGGAAGATGTAACGGGGCTAAGCTATATACAGAAAAAAAAAAAACAAAAAAAAAAGATCGGAAGAGCACACGTCTGAACTCCAGTCA
GAGTCGTCCTGCGCGGAAGATGTAACGGGGCTAAGCTATATACCAAAAAAAAAAACAAAAAAAAAAGATCGGAAGAGCACACGTCTGAACTCCAGTCACG
GAGTCGTCCTGCGCGGAAGATGTAACGGGGCTAAGCTATATACCAAAAAAAAAAACAAAAAAAAAAGATCGGAAGAGCACACGTCTGAACTCCAGTCACG
GAGTCGTCCTGCGCGGAAGATGTAACGGGGCTAAGCTATATACCAAAAAAAAAACAAAAAAAAAAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGC
GAGTCGTCCTGCGCGGAAGATGTAACGGGGCTAAGCTATATACCGAAAAAAAAAACAAAAAAAAAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGC
GAGTCGTCCTGCGCGGAAGATGTAACGGGGCTAAGCTATATACCGAAGCAAAAAAAAAACAAAAAAAAAAAGATCGGAAGAGCACACGTCTGAACTCCAG
GAGTCGTCCTGCGCGGAAGATGTAACGGGGCTAAGCTATATACCGAAGCTGCAAAAAAAAAACAAAAAAAAAAGATCGGAAGAGCACACGTCTGAACTCC
GAGTCGTCCTGCGCGGAAGATGTAACGGGGCTAAGCTATATACCGAAGCTGCGGATGCGTGCCAAAAAAAAAACAAAAAAAAAAGATCGGAAGAGCACAC
GAGTCGTCCTGCGCGGAAGATGTAACGGGGCTAAGCTATATACGGAAGCTAAAAAAAAAAACAAAAAAAAAAGATCGGAAGAGCACACGTCTGAACTCCA
GAGTCGTCCTGCGCGTAAAAAAAAAACAAAAAAAAAAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCTCGTATGCCGTCTTCTGCTTG
GAGTCGTCCTGCGCGTAAGAAAAAAAAAACAAAAAAAAAAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCTCGTATGCCGTCTTCTGC
We can immediately see that this will not consolidate well.
The AAAAAAAAAACAAAAAAAAAA sequence seems to drift through the sequence.
Deduplication
UMI Tagging / Consolidation https://github.com/ucsffrancislab/umi
umi_tools
Post alignment deduplication base primarily on position then UMI.
picard UMIAware....
Post alignment deduplication base primarily on position then UMI. Therefore only works on aligned reads.
aryeelab/umi (our modified version)
Pre-alignment so if UMI is duplicated, this results in a lot of lost data
my umi_dedup.bash script
Loops over UMIs and keeps the best that aren't similar. Kinda slow. Not working for paired yet.