Skip to the content.

ImputationPrep_AGS_GWAS

Adult Glioma Study GWAS : Imputed on Top Med

Pipeline based on https://imputationserver.readthedocs.io/en/latest/prepare-your-data/

Update https://www.well.ox.ac.uk/~wrayner/tools/HRC-1000G-check-bim-v4.3.0.zip

More info on https://www.well.ox.ac.uk/~wrayner/tools/

This takes less than an hour.

Start instance

type       vCPUs  Memory  Storage  Storage type Network performance On-Demand Linux pricing
i3.xlarge    4     31232    950      ssd         Up to 10 Gigabit     0.312 USD per Hour
i3.2xlarge   8     62464   1900      ssd         Up to 10 Gigabit     0.624 USD per Hour  
create_ec2_instance.bash --profile gwendt --image-id ami-0323c3dd2da7fb37d --instance-type i3.xlarge --key-name ~/.aws/JakeHervUNR.pem --NOT-DRY-RUN


ip=$( aws --profile gwendt ec2 describe-instances --query 'Reservations[0].Instances[0].PublicIpAddress' | tr -d '"' )
echo $ip
ssh -i /Users/jakewendt/.aws/JakeHervUNR.pem -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no ec2-user@$ip

Update instance

sudo yum -y update
sudo yum -y install docker htop

Prepare attached ssd

The instance only has 8GB by default which will be rather full already. This data will take up a couple GB. Docker related data will take up over 20GB. Use the SSD for data and docker. Its faster too.

lsblk
sudo mkfs -t xfs /dev/nvme0n1
sudo mkdir -p /ssd0/
sudo mount /dev/nvme0n1 /ssd0/
sudo chown ec2-user /ssd0/
mkdir -p /ssd0/data
mkdir -p /ssd0/docker

Download data

aws s3 sync s3://herv-unr/20200520_Adult_Glioma_Study_GWAS/ /ssd0/data/

Setup docker

sudo bash -c 'echo { \"data-root\":\"/ssd0/docker/\" } >> /etc/docker/daemon.json'
sudo service docker start
sudo usermod -a -G docker ec2-user

Exit and reconnect ( apparently for the usermod permissions to docker )

exit
ssh -i /Users/jakewendt/.aws/JakeHervUNR.pem -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no ec2-user@$ip

Get docker file

wget https://raw.githubusercontent.com/ucsffrancislab/genomics/master/docker/ImputationPrep.Dockerfile

Build it

docker build -t impute --file ~/ImputationPrep.Dockerfile ./

Run it in the background

docker run -v /ssd0/data:/data -itd impute

Connect to it

docker exec -it $( docker ps -aq ) bash

Convert ped/map to bed ( I already have the bed files. And don't have ped/map files. )

Move / Copy / Filter into /data/out directory

JAKE ADDED - filter out chromosome 0 with --not-chr 0 Nevermind. Filtering nnecessary. Use the *-updated-chr* bed files.

Do need to copy the files and remaking them cleans them up a bit.

mkdir /data/out/
plink --bfile /data/il370_4677 --make-bed --out /data/out/il370_4677
plink --bfile /data/onco_1347  --make-bed --out /data/out/onco_1347

Create a frequency file

plink --freq --bfile /data/out/il370_4677 --out /data/out/il370_4677.freq
plink --freq --bfile /data/out/onco_1347  --out /data/out/onco_1347.freq

Execute script ( failed on my laptop as ran out of memory (15GB) )

Original version, loading tab goes up to 40400000 and uses about 22GB memory and create Run-plink.sh

New version goes up to 40405506 and uses less than 1 GB of memory! This could now be run on a laptop.

cd /data/out

perl /home/HRC-1000G-check-bim.pl -b /data/out/il370_4677.bim -f /data/out/il370_4677.freq.frq -r /home/HRC.r1-1.GRCh37.wgs.mac5.sites.tab -h

mv Run-plink.sh Run-plink-il370_4677.sh
sh Run-plink-il370_4677.sh

perl /home/HRC-1000G-check-bim.pl -b /data/out/onco_1347.bim -f /data/out/onco_1347.freq.frq -r /home/HRC.r1-1.GRCh37.wgs.mac5.sites.tab -h

mv Run-plink.sh Run-plink-onco_1347.sh
sh Run-plink-onco_1347.sh

Create vcf using VcfCooker

Which reference? Does it really matter? Probably needs matching chomosome names?

USE -updated file.

Seems that the imputation server uses the individual chromosome files so NOT this ...

vcfCooker --in-bfile /data/out/il370_4677-updated --ref /home/hs37d5.fa --out /data/out/il370_4677.vcf --write-vcf
bgzip /data/out/il370_4677.vcf

vcfCooker --in-bfile /data/out/onco_1347-updated  --ref /home/hs37d5.fa --out /data/out/onco_1347.vcf  --write-vcf
bgzip /data/out/onco_1347.vcf

But something more like this ...

	for i in $( seq 1 23 ) ; do
		vcfCooker --in-bfile /data/out/${s}-updated-chr${i} \
			--ref /home/hs37d5.fa \
			--out /data/out/${s}-updated-chr${i}.vcf \
			--write-vcf
		bgzip /data/out/${s}-updated-chr${i}.vcf
	done

Exit Docker and Copy out data

exit

aws s3 sync /ssd0/data/out/ s3://herv-unr/20200520_Adult_Glioma_Study_GWAS_OUTPUT_$(date "+%Y%m%d")/

Kill Instance

Assuming that this is the ONLY container, stop it and remove it when done. Of course, if you are about to shutdown the AWS instance, its irrelevant.

docker stop $( docker ps -aq )

docker rm $( docker ps -aq )

sudo shutdown now

Additional Tools

While I don't use these tools, I do use the checkVCF provided reference. Although, since I haven't seen any results, this and anything could change.

Convert ped/map files to VCF files

Several tools are available: plink2, BCFtools or VcfCooker.

plink --ped study_chr1.ped --map study_chr1.map --recode vcf --out study_chr1

Create a sorted vcf.gz file using BCFtools:

bcftools sort study_chr1.vcf -Oz -o study_chr1.vcf.gz

CheckVCF

Use checkVCF to ensure that the VCF files are valid. checkVCF proposes "Action Items" (e.g. upload to sftp server), which can be ignored. Only the validity should be checked with this command.

checkVCF.py -r human_g1k_v37.fasta -o out mystudy_chr1.vcf.gz

Imputation Server

Redo and process per chromosome.

set -v
mkdir /data/out/
for s in il370_4677 onco_1347 ; do
  plink --bfile /data/${s} --make-bed --out /data/out/${s}
  plink --freq --bfile /data/out/${s} --out /data/out/${s}.freq

  cd /data/out

  perl /home/HRC-1000G-check-bim.pl \
    -b /data/out/${s}.bim \
    -f /data/out/${s}.freq.frq \
    -r /home/HRC.r1-1.GRCh37.wgs.mac5.sites.tab -h

  mv Run-plink.sh Run-plink-${s}.sh
  sh Run-plink-${s}.sh

  for i in $( seq 1 23 ) ; do
    vcfCooker --in-bfile /data/out/${s}-updated-chr${i} \
      --ref /home/hs37d5.fa \
      --out /data/out/${s}-updated-chr${i}.vcf \
      --write-vcf
    bgzip /data/out/${s}-updated-chr${i}.vcf
  done

done
aws s3 sync /ssd0/data/out/ s3://herv-unr/20200520_Adult_Glioma_Study_GWAS_OUTPUT_$(date "+%Y%m%d")/

Apparently there is a way to load into imputation server from S3, but haven't figured it out yet. Download locally, then upload to server.

Use TopMed's Server

With these options

Data sets must be uploaded separately.