Skip to the content.

Cutadapt

Install

python3 -m pip install --user --upgrade cutadapt

Prep

Trying to verbalise all this.

What is the orientation of a proper pair of reads?

If we extract a proper pair aligned reads we get ...

samtools view -f 2 1_10.format.umi.trim.Aligned.sortedByCoord.out.umi_tag.fixmate.deduped.bam | head -2

VH00749:75:AACVVCTM5:1:1105:57068:17225	99	chr1	10001	255	31S101M	=	10001	103	GGCACCCTAACCCTAACCCTAAACCTAACCCTAACCCTAACCCTAACCCAAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACC	CCCCCCCCCCCCCCCCCCCCCC-CCCCCCCCCCCCCCCCCCCCCCCCCC;C-CCCC;CCCCCCCCCCCCC-CCC;CC-C--CCCCCC-CCCCCCCCCCCCC;CC-CC;-CCCCC;--CCCCCCCCCCC;CCC	MC:Z:20S103M	RG:Z:1_10	NH:i:1	HI:i:1	nM:i:2	MQ:i:255	AS:i:198	RX:Z:GTAAGTTGGTTGGTTAGG
VH00749:75:AACVVCTM5:1:1105:57068:17225	147	chr1	10001	255	20S103M	=	10001	-103	CCTATCCCTAACCCTAACCCTAACCCTAACCCTAACACTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCT	CCC;-CCCCCCCC-CCCCCC-C;;-;CCCC-CCCC--;-C;CCCCC-C--CCC-C;-CCC-;-C-CCCC-C;CC;CCC-CCC;CCCCCCCCC;CCCCCCC;CCCCCCCCCCCCCCCCC-CCC-	MC:Z:31S101M	RG:Z:1_10	NH:i:1	HI:i:1	nM:i:2	MQ:i:255	AS:i:198	RX:Z:GTAAGTTGGTTGGTTAGG

The first read is forward, the second is reverse complemented.

samtools flags 99
0x63	99	PAIRED,PROPER_PAIR,MREVERSE,READ1

samtools flags 147
0x93	147	PAIRED,PROPER_PAIR,REVERSE,READ2

We see here the reads in the bam and the fastq, R1 are the same, and R2 are reverse complemented.

samtools view -f 2 1_10.format.umi.trim.Aligned.sortedByCoord.out.umi_tag.fixmate.deduped.bam | head -2 | awk '{print $10}'
GGCACCCTAACCCTAACCCTAAACCTAACCCTAACCCTAACCCTAACCCAAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACC
CCTATCCCTAACCCTAACCCTAACCCTAACCCTAACACTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCT

zcat 1_10.format.umi.trim.Aligned.sortedByCoord.out.umi_tag.fixmate.deduped.R1.fastq.gz | head -2
@VH00749:75:AACVVCTM5:1:1105:57068:17225/1
GGCACCCTAACCCTAACCCTAAACCTAACCCTAACCCTAACCCTAACCCAAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACC

zcat 1_10.format.umi.trim.Aligned.sortedByCoord.out.umi_tag.fixmate.deduped.R2.fastq.gz | head -2
@VH00749:75:AACVVCTM5:1:1105:57068:17225/2
AGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGTGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGATAGG

PREADAPTER - SEQUENCE - POLYA - POSTADAPTER
R1 -    PREADAPTER - SEQUENCE - POLYA   -    POSTADAPTER

R2 - RCPOSTADAPTER -  RCPOLYA - RCSEQUENCE - RCPREADAPTER

I think that the sequencer removes the PREADAPTER during processing. Think that call it an index?

So trim with -a POLYA and -G RCPOLYA

Both of these adapters trim the leftmost first so -a is greedy and -G is NOT greedy.`

Because -G is trimming from the left and not greedy, we need to increase --times

--times 12

or make it longer

-G T{20}

or include varying length of it

-G T{10} -G T{20} -G T{30} -G T{40}

Not sure which is best

Usage

A simple, identical pair of reads with a polyA in the middle

cat cutadapt.R1.fa
>1
GTGTAGATCTCGGTGGTCGCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGTCTCGGTGGTCGCCGTATCATT

cat cutadapt.R2.fa
>1
GTGTAGATCTCGGTGGTCGCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGTCTCGGTGGTCGCCGTATCATT

It is my understanding the R1's are right trimmed (-a) and R2's are left trimmed (-G)

cutadapt --times 1 -a A{10} -o out.R1.fa -p out.R2.fa cutadapt.R1.fa cutadapt.R2.fa

cat out.R?.fa
>1
GTGTAGATCTCGGTGGTCGC
>1
GTGTAGATCTCGGTGGTCGCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGTCTCGGTGGTCGCCGTATCATT
cutadapt --times 1 -G A{10} -o out.R1.fa -p out.R2.fa cutadapt.R1.fa cutadapt.R2.fa

cat out.R?.fa
>1
GTGTAGATCTCGGTGGTCGCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGTCTCGGTGGTCGCCGTATCATT
>1
AAAAAAAAAAAAAAAAAAAAAAAAAAGTCTCGGTGGTCGCCGTATCATT

Both -a and -G choose the first from the left and trim in their direction. They ARE NOT greedy by trimming the largest. This seems odd and perhaps I'm doing something wrong.

In order to force the issue and make -G greedy you need to either make the poly string arbitrarilty longer or increase the --times

cutadapt --times 2 -G A{10} -o out.R1.fa -p out.R2.fa cutadapt.R1.fa cutadapt.R2.fa

cat out.R?.fa
>1
GTGTAGATCTCGGTGGTCGCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGTCTCGGTGGTCGCCGTATCATT
>1
AAAAAAAAAAAAAAAAGTCTCGGTGGTCGCCGTATCATT
cutadapt --times 3 -G A{10} -o out.R1.fa -p out.R2.fa cutadapt.R1.fa cutadapt.R2.fa

cat out.R?.fa
>1
GTGTAGATCTCGGTGGTCGCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGTCTCGGTGGTCGCCGTATCATT
>1
AAAAAAGTCTCGGTGGTCGCCGTATCATT
cutadapt --times 4 -G A{10} -o out.R1.fa -p out.R2.fa cutadapt.R1.fa cutadapt.R2.fa

cat out.R?.fa
>1
GTGTAGATCTCGGTGGTCGCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGTCTCGGTGGTCGCCGTATCATT
>1
GTCTCGGTGGTCGCCGTATCATT

Check Check Check

Fragments submitted to sequencer ???

P5 - I5 - R1 Primer - Sequence - poly A - R2 Primer (RC?) - I7 - P7

Sequencer returns what? They left trim at some point and return everything after?

R1 reads the main sequence left to right R2 effectively reads the mate right to left

R1

R2