Features and Feature Counting

Let's evaluate the expression of each sample to LTRs and/or SVAs.

We could find/create a reference and directly align our samples to this reference and count the reads that successfully align.

Or we could align to a human reference and then use featureCounts from subread and a feature file.

While I can't find a gtf/gff file for LTRs and SVAs, there are bed files available through IGV.

IGV
File -> Load from Server ...
Available Datasets -> Annotations -> Repeat Masker 
LTR
Retroposon

Once loaded, hovering over the gene track will show a URL to download the bed files.

Or we could look directly in the IGV github repo

We could download them all from their S3 bucket

aws s3 sync s3://igv.org.genomes/hg38/rmsk/ local_rmsk/

wget https://s3.amazonaws.com/igv.org.genomes/hg38/rmsk/hg38_rmsk_LTR.bed.gz

zcat hg38_rmsk_LTR.bed.gz | head
chr1	21948	22075	MLT1K	254	+
chr1	30693	30848	MLT1A	741	+
chr1	30952	31131	MLT1A	741	+
chr1	34564	34921	MLT1J2	850	-
chr1	38255	39464	MLT1E1A-int	3877	+
chr1	39464	39623	MLT1E1A	750	+
chr1	39924	40294	MLT1E1A	783	+
chr1	40735	40878	LTR16C	260	-
chr1	41393	42273	ERV3-16A3_I-int	765	-
chr1	42369	42504	MamRep1527	341	+

zcat hg38_rmsk_LTR.bed.gz | wc -l
766019

We can see that the 4th column (name) is repeated. How often?

zcat hg38_rmsk_LTR.bed.gz | awk '{print $4}' | sort | uniq -c | wc -l
595

Very often. This number is very close to that referenced in the https://www.nature.com/articles/s41467-019-13035-2 data.

The bedtools that can be used to convert a bed file to a gtf will make an ugly gtf and treats each line separately and I've found no option to merge them.

bedToGenePred hg38_rmsk_LTR.bed.gz stdout | genePredToGtf file stdin stdout | head
chr1	stdin	exon	21949	22075	.	+	.	gene_id "MLT1K"; transcript_id "MLT1K"; exon_number "1"; exon_id "MLT1K.1";
chr1	stdin	CDS	21949	22075	.	+	0	gene_id "MLT1K"; transcript_id "MLT1K"; exon_number "1"; exon_id "MLT1K.1";
chr1	stdin	start_codon	21949	21951	.	+	0	gene_id "MLT1K"; transcript_id "MLT1K"; exon_number "1"; exon_id "MLT1K.1";
chr1	stdin	exon	30694	30848	.	+	.	gene_id "MLT1A"; transcript_id "MLT1A"; exon_number "1"; exon_id "MLT1A.1";
chr1	stdin	CDS	30694	30848	.	+	0	gene_id "MLT1A"; transcript_id "MLT1A"; exon_number "1"; exon_id "MLT1A.1";
chr1	stdin	start_codon	30694	30696	.	+	0	gene_id "MLT1A"; transcript_id "MLT1A"; exon_number "1"; exon_id "MLT1A.1";
chr1	stdin	exon	30953	31131	.	+	.	gene_id "MLT1A_2"; transcript_id "MLT1A_2"; exon_number "1"; exon_id "MLT1A_2.1";
chr1	stdin	CDS	30953	31131	.	+	0	gene_id "MLT1A_2"; transcript_id "MLT1A_2"; exon_number "1"; exon_id "MLT1A_2.1";
chr1	stdin	start_codon	30953	30955	.	+	0	gene_id "MLT1A_2"; transcript_id "MLT1A_2"; exon_number "1"; exon_id "MLT1A_2.1";
chr1	stdin	exon	34565	34921	.	-	.	gene_id "MLT1J2"; transcript_id "MLT1J2"; exon_number "1"; exon_id "MLT1J2.1";

We can see that each name is repeated often.

zcat hg38_rmsk_LTR.bed.gz | awk '{print $4}' | sort | uniq -c | head
    357 ERV24B_Prim-int
    430 ERV24_Prim-int
   6681 ERV3-16A3_I-int
   1115 ERV3-16A3_LTR
   4268 ERVL-B4-int
   8778 ERVL-E-int
    884 ERVL-int
    215 ERVL47-int
      2 EUTREP15
      2 EUTREP16

Not certain why 1 is added to the start position, but bedtools did it so monkey see, monkey do. I'll keep things simple. We don't know all of the other tags and certainly don't need multiple entries for each.

bedToGenePred hg38_rmsk_LTR.bed.gz stdout | \
	awk 'BEGIN{FS=OFS="\t"}{print $2,"source","feature",$4+1,$5,".",$3,".","feature_name \""$1"\""}' \
	> hg38_rmsk_LTR.jake.gtf

head hg38_rmsk_LTR.jake.gtf
chr1	source	feature	21949	22075	.	+	.	feature_name "MLT1K"
chr1	source	feature	30694	30848	.	+	.	feature_name "MLT1A"
chr1	source	feature	30953	31131	.	+	.	feature_name "MLT1A"
chr1	source	feature	34565	34921	.	-	.	feature_name "MLT1J2"
chr1	source	feature	38256	39464	.	+	.	feature_name "MLT1E1A-int"
chr1	source	feature	39465	39623	.	+	.	feature_name "MLT1E1A"
chr1	source	feature	39925	40294	.	+	.	feature_name "MLT1E1A"
chr1	source	feature	40736	40878	.	-	.	feature_name "LTR16C"
chr1	source	feature	41394	42273	.	-	.	feature_name "ERV3-16A3_I-int"
chr1	source	feature	42370	42504	.	+	.	feature_name "MamRep1527"

featureCounts will treat each of these seperately, but in the final csv it will concatenate the chromosome, start, finish and strand fields for each matching label, and it seems to total up the length.

This is irrelevant, as we will remove these fields anyway with a command like ... sed -r 's/\t\S+\t\S+\t\S+\t\S+\t\S+//1' LTR_features.csv > LTR_features.new.csv

Let's do the same for the SVAs.

wget https://s3.amazonaws.com/igv.org.genomes/hg38/rmsk/hg38_rmsk_Retroposon.bed.gz

zcat hg38_rmsk_Retroposon.bed.gz | head
chr1	837098	837180	SVA_D	276	+
chr1	6304393	6306236	SVA_D	9901	-
chr1	6714507	6716330	SVA_D	6435	-
chr1	7600485	7602002	SVA_D	11216	-
chr1	7952097	7953558	SVA_D	5411	+
chr1	8079151	8079209	SVA_F	308	-
chr1	8260046	8260103	SVA_B	251	+
chr1	8260237	8260251	SVA_B	251	+
chr1	8500394	8501159	SVA_D	4299	+
chr1	8501169	8503087	SVA_E	6406	+

zcat hg38_rmsk_Retroposon.bed.gz | wc -l
5882

zcat hg38_rmsk_Retroposon.bed.gz | awk '{print $4}' | sort | uniq -c | wc -l
6

zcat hg38_rmsk_Retroposon.bed.gz | awk '{print $4}' | sort | uniq -c
   1147 SVA_A
    873 SVA_B
    527 SVA_C
   1583 SVA_D
    709 SVA_E
   1043 SVA_F

bedToGenePred hg38_rmsk_Retroposon.bed.gz stdout | \
	awk 'BEGIN{FS=OFS="\t"}{print $2,"source","feature",$4+1,$5,".",$3,".","feature_name \""$1"\""}' \
	> hg38_rmsk_Retroposon.jake.gtf

And concatenate them.

cat hg38_rmsk_LTR.jake.gtf hg38_rmsk_Retroposon.jake.gtf > hg38_rmsk_LTR_Retroposon.jake.gtf

~/.local/bin/featureCounts.bash \
	-T 16 -a hg38_rmsk_LTR_Retroposon.jake.gtf \
	-t feature -g feature_name \
	-o LTR_SVA_features.csv *.STAR.hg38.Aligned.out.bam"

tail -n +2 LTR_SVA_features.csv | sed -r 's/\t\S+\t\S+\t\S+\t\S+\t\S+//1' > LTR_SVA_features.new.csv

tail -n +2 LTR_SVA_features.new.csv | head -3
Geneid	Length	s001	s002	s003	s004	s005	s006	s007	s008	s009	s010	s012	s013	s014	s015	s017	s018	s019	s020	s021	s023	s024	s025	s026	s028	s029	s033	s034	s036	s037	s038	s040	s041	s043	s044	s045	s046	s047	s048	s049	s051	s052	s053	s054	s055	s056	s057	s058	s059	s060	s061	s062	s063	s064	s065	s067	s068	s069	s070	s071	s072	s073	s074	s075	s076	s077	s079	s080	s083	s084	s086	s087	s089	s090	s091	s092	s093	s094	s095	s096	s097	s098	s099	s100	s101	s102	s103	s104	s105	s106	s107	s108	s110	s111	s112	s113	s114	s115	s116	s117	s119	s120	s122	s123	s124	s125	s126	s127	s128	s129	s130	s132	s133	s134	s135	s137	s139	s141	s142	s143	s144	s145	s146	s147	s150	s151	s152	s153	s155	s156	s157	s158	s159	s160	s161	s162	s163	s164	s165	s166	s167	s168	s169	s170	s171	s173	s174	s175	s176	s178	s179	s180	s181	s182	s183	s184	s185	s186	s187	s188	s189	s190	s191	s192	s193	s194	s197	s198	s199	s200	s201	s202	s203	s204	s205	s206	s207	s208	s209	s210	s212	s213	s214	s215	s216	s217	s218	s219	s221	s222	s224	s225	s226	s227	s228	s230	s231	s232	s233	s234	s235
MLT1K	4019453	9640	9010	3696	7187	10878	9041	7289	8477	5774	13994	5816	7426	7658	4594	8486	8255	7251	7668	9505	5314	7398	12797	22960	21627	10011	14769	12159	10297	20500	14047	13519	9024	6693	12407	10455	7892	10178	6636	6701	7779	9792	4663	14697	7834	14427	18964	8924	11214	13611	10872	13581	12352	10580	10295	10975	9656	12285	11883	9449	12038	5887	9650	12081	12676	13176	7519	12732	14170	17	6865	11296	10603	7306	4324	6502	9867	10886	11110	11116	7892	11121	8497	6385	10998	15508	10145	10809	6531	7929	7803	9626	6676	13075	10131	12508	7071	12362	4103	3157	19452	10130	11369	10013	9630	11451	11249	12194	12621	15438	12420	12489	9246	8122	11109	11584	9588	10635	13896	10030	10813	9526	7544	5286	11396	14107	11565	10823	9976	10594	7510	9911	8115	10597	10700	11082	10239	7615	14911	12322	11079	13173	7989	11624	6317	6997	8523	6162	9003	8064	5585	9527	10777	6695	8448	7201	5105	9771	8022	7298	6943	8037	9153	5036	8079	7981	8515	8149	8520	7842	7281	7422	6639	15129	7131	7581	8095	7290	5797	6955	7210	6247	7391	9310	6175	7719	5774	7071	9120	8075	6132	5989	4423	12107	5753	5789	8439	6980	7364	8890	7691
MLT1A	2578604	4394	7521	1985	4716	5274	3781	5315	5127	2753	7936	3169	4135	2901	2435	3532	4421	3202	2613	4279	2896	3149	7146	9837	9436	4684	8012	5858	6137	9765	8081	7053	5043	3153	5615	4269	3122	4433	3324	3597	3952	3957	1961	4747	4039	6844	11073	4802	5579	7442	6382	8316	4318	5566	4621	5558	3878	5377	5098	4993	5665	2857	5171	4626	5497	6030	3975	6740	5815	4	2837	4890	4406	3631	2438	4473	4213	3073	7327	6098	3932	5841	4333	2873	6365	7272	4325	6434	3266	3838	3132	5314	3531	5123	5003	5673	2771	5554	3286	4190	10526	4441	4965	3891	5936	5475	6649	6037	5654	7106	6082	4670	4309	3291	4441	6176	4169	3831	7897	4413	6552	3889	3755	2256	4813	7413	4937	4373	5752	4846	3091	4083	3929	5319	5127	6424	3985	3453	6443	6046	5017	6349	3216	4327	2740	3442	3679	2975	4413	3815	2039	4493	5107	3049	3854	3284	3146	4055	3907	3579	2856	4160	4288	2016	3870	3894	3331	3944	3863	3339	3654	2955	2963	8319	3015	3160	4382	3879	2977	4190	2919	2882	2806	4630	2170	3638	2255	3150	3814	3086	3373	3257	2335	4850	2561	2592	3895	3032	2919	3504	3597