Skip to the content.

REdiscoverTE

https://github.com/ucsffrancislab/REdiscoverTE

https://www.nature.com/articles/s41467-019-13035-2

http://research-pub.gene.com/REdiscoverTEpaper/

http://research-pub.gene.com/REdiscoverTEpaper/data/

http://research-pub.gene.com/REdiscoverTEpaper/data/REdiscoverTEdata_README.html

http://research-pub.gene.com/REdiscoverTEpaper/data/REdiscoverTEdata_1.0.1.tar.gz

http://research-pub.gene.com/REdiscoverTEpaper/software/

http://research-pub.gene.com/REdiscoverTEpaper/software/REdiscoverTE_README.html

http://research-pub.gene.com/REdiscoverTEpaper/software/REdiscoverTE_1.0.1.tar.gz

zcat REdiscoverTE/rollup_annotation/REdiscoverTE_whole_transcriptome_hg38.fa.gz > REdiscoverTE.fa
salmon index \
	-t REdiscoverTE.fa \
	--threads 64 \
	-i REdiscoverTE
for f in ${DIR}/???.fastq.gz ; do

	echo $f

	base=${f%.fastq.gz}
	echo $base

	echo "salmon quant --seqBias --gcBias \
		--index REdiscoverTE \
		--libType A --unmatedReads ${f} \
		--validateMappings \
		-o ${base}.salmon.REdiscoverTE \
		--threads 8" | \
		qsub -l vmem=64gb \
		-N $(basename $f .fastq.gz) \
		-l nodes=1:ppn=8 \
		-j oe -o ${base}.salmon.REdiscoverTE.out.txt

done

>a=readRDS("rmsk_annotation.RDS")

> head(a)
                               md5 n.loci   repName      repClass     repFamily
1 9bd267abb08d45904c1741db501b5bdd      1 (TAACCC)n Simple_repeat Simple_repeat
2 e98ef316f41469ed5d18179d6f4c246f      1      TAR1     Satellite          telo
3 72834a9eaede5c9cad25de2588a65486      1      MIR3          SINE           MIR
4 accd2fc2b8c6fb63f05975ebeaf89439      1        L3          LINE           CR1
5 7c0d2042077280c3b66ace0a6b875aa0      1       MIR          SINE           MIR
6 a1307bfab38d3c74e299da0da2e6cf5e      1       MIR          SINE           MIR
  selected_feature            idx
1       intergenic 1__10001_10468
2       intergenic 1__10469_11447
3           intron 1__15265_15355
4           intron 1__19972_20405
5           intron 1__23120_23371
6           intron 1__24088_24250


>names(a)
[1] "md5"              "n.loci"           "repName"          "repClass"        
[5] "repFamily"        "selected_feature" "idx"             



> a=readRDS("GENCODE.V26.Basic_Gene_Annotation_md5.RDS")
> names(a)
[1] "md5"           "n.genes"       "transcript_ID" "ensembl_ID"   
[5] "symbol"        "geneID"       
> head(a)
                               md5 n.genes   transcript_ID      ensembl_ID
1 0001d490f52254a5ada6d26ad8880585       1 ENST00000263095 ENSG00000083844
2 0002168cd59867bac9fbab470d7aecf3       1 ENST00000442825 ENSG00000109684
3 0002d0449e073ab67c8a80063a42dbf2       1 ENST00000317538 ENSG00000163867
4 000400e08761c20fdf45ece97312b29e       1 ENST00000635564 ENSG00000236963
5 0005968766852627902d3f8ea965e224       1 ENST00000352065 ENSG00000148737
6 0005a4315808f41bb2ab181627a54afb       1 ENST00000472609 ENSG00000220831
     symbol geneID
1    ZNF264   9422
2      CLNK 116449
3     ZMYM6   9204
4 LINC01141     NA
5    TCF7L2   6934
6  NDUFA5P9     NA

BiocManager::install(c("tibble","readr","dplyr","edgeR","Biobase"),update = TRUE, ask = FALSE)
./rollup.R -h
Usage: Rscript rollup.R [options]
Rollup takes 1-5 minutes per quant.sf file on a 2018 laptop for a quant.sf with 5+ million lines.
USAGE EXAMPLE:
     rollup.R --metadata=./METADATA.tsv    --datadir=./path/to/annotation_dir/  --assembly=hg38 --outdir=Rolled_up_out_dir
 or:
     rollup.R --metadata=./METADATA.tsv    --datadir=./path/to/annotation_dir/  --priorcount=1 --nozero --threads=8 --assembly=hg38 --outdir=Rolled_up_out_dir

 The METADATA.tsv (tab-delimited) file needs two columns: 'sample' and 'quant_sf_path', as in this example:
       sample         quant_sf_path                  <-- These column names are required. Capitalization is irrelevant.
       SOME_ID_1      /full/path/to/ID1/quant.sf.gz
       SOME_ID_2      /full/path/to/ID2/quant.sf.gz  <-- Rollup can handle gzipped files as well.
       SOME_ID_3      /full/path/to/ID3/quant.sf         Gzipping your quant.sf files is recommended.
 
 ABOUT THE OUTPUT FILES:
     * 'genenorm' files: CPM was calculated using the gene-level counts (lib.sizes).
         i.e., samples are normalized by gene-level counts, not the repetitive-element-level (RE-level) counts.
         We make the underlying assumption that gene-level counts are more stable than RE-level counts.
     * 'rle_normed_eset' have the same data as above, but are presented as ExpressionSet (R) objects.
 
RollUp takes one or more 'quant.sf' files from Salmon (each of which contains transcript-level counts for a single sample) and aggregates the results by repetitive element category ('repName').

Options:
	-m METADATA, --metadata=METADATA
		Input metadata .TSV file with two columns, 'quant_sf_path' and 'sample'. See '--help' for details.

	-t THREADS, --threads=THREADS
		Num processor threads to use. Default is 2.

	-a ASSEMBLY, --assembly=ASSEMBLY
		Genome assembly version. Must be 'hg38' for now.

	-d DATADIR, --datadir=DATADIR
		Directory with reference transcriptome annotation data files. This data should have been included with your 'rollup' download.

	-o OUTDIR, --outdir=OUTDIR
		The output directory for Rollup results, which will be created by this program.

	-p PRIORCOUNT, --priorcount=PRIORCOUNT
		Prior count used in edgeR::cpm(...) calculation. Default is 5.

	-v, --verbose
		Prints verbose messages. Useful for debugging.

	-z, --nozero
		Removes any elements (rows) with zero counts. Note: this causes .RDS output files to have differing numbers of rows.

	-h, --help
		Show this help message and exit


echo -e "sample\tquant_sf_path" > ${DIR}/REdiscoverTE.tsv
ls -1 ${DIR}/trimmed/*REdiscoverTE/quant.sf \
		| awk -F/ '{split($8,a,".");print a[1]"\t"$0}' \
		>> ${DIR}/REdiscoverTE.tsv

echo "rollup.R \
		--metadata=${DIR}/REdiscoverTE.tsv \
		--datadir=REdiscoverTE/rollup_annotation/ \
		--nozero --threads=64 --assembly=hg38 \
		--outdir=${DIR}/REdiscoverTE_rollup/" | \
		qsub -l vmem=500gb -N rollup \
		-l nodes=1:ppn=64 \
		-j oe -o ${DIR}/REdiscoverTE_rollup.out.txt

I started rolling up in smaller chunks and then merging. This alleviates memory issues.

20220303-FluPaper/20220707-REdiscoverTE/REdiscoverTE_rollup.bash

20220303-FluPaper/20220707-REdiscoverTE/REdiscoverTE_rollup_merge.bash

And I am running the rmarkdown EdgeR script.

20220303-FluPaper/20220707-REdiscoverTE/REdiscoverTE_EdgeR_rmarkdown.bash