Visualizing Large Scale Genomic Variation copy
Generating VCF Files (Simulated data)
VCF files are produced by running a variant caller on one or more BAM alignment files.
We will download the ebola genome (AF086833) into a "refs" directory, create a "bwa index" for the ebola genome, create a "samtools index" for the ebola genome, and then simulated reads from the ebola genome using "dwgsim".
First, install dwgsim.
conda install dwgsim
Usage (at command line): dwgsim [options]
Then create the directories, download the data, and create the simulated reads.
mkdir variants
cd variants
mkdir -p refs
efetch -db=nuccore -format=fasta -id=AF086833 > refs/AF086833.fa
bwa index refs/AF086833.fa
samtools index refs/AF086833.fa
dwgsim refs/AF086833.fa simulated
simulated.bfast.fastq
simulated.bwa.read1.fastq
simulated.bwa.read2.fastq
simulated.mutations.txt
simulated.mutations.vcf
Take a look at the output file "simulated.mutations.vcf".
##contig=<ID=AF086833.2,length=18959>
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=pl,Number=1,Type=Integer,Description="Phasing: 1 - HET contig 1, #2 - HET contig #2, 3 - HOM both contigs">
##INFO=<ID=mt,Number=1,Type=String,Description="Variant Type: SUBSTITUTE/INSERT/DELETE">
#CHROM POS ID REF ALT QUAL FILTER INFO
AF086833.2 140 . T A 100 PASS AF=0.5;pl=2;mt=SUBSTITUTE
AF086833.2 354 . A T 100 PASS AF=1.0;pl=3;mt=SUBSTITUTE
AF086833.2 882 . C G 100 PASS AF=1.0;pl=3;mt=SUBSTITUTE
AF086833.2 1308 . T A 100 PASS AF=1.0;pl=3;mt=SUBSTITUTE
AF086833.2 4590 . A C 100 PASS AF=0.5;pl=1;mt=SUBSTITUTE
AF086833.2 5856 . A T 100 PASS AF=0.5;pl=1;mt=SUBSTITUTE
AF086833.2 6120 . C T 100 PASS AF=0.5;pl=1;mt=SUBSTITUTE
AF086833.2 7979 . T G 100 PASS AF=1.0;pl=3;mt=SUBSTITUTE
AF086833.2 11398 . T A 100 PASS AF=1.0;pl=3;mt=SUBSTITUTE
AF086833.2 13756 . T C 100 PASS AF=1.0;pl=3;mt=SUBSTITUTE
AF086833.2 14173 . A G 100 PASS AF=1.0;pl=3;mt=SUBSTITUTE
AF086833.2 14644 . T C 100 PASS AF=1.0;pl=3;mt=SUBSTITUTE
AF086833.2 15864 . CA C 100 PASS AF=1.0;pl=3;mt=DELETE
AF086833.2 16499 . T A 100 PASS AF=0.5;pl=1;mt=SUBSTITUTE
AF086833.2 17596 . A AA 100 PASS AF=0.5;pl=2;mt=INSERT
bwa mem refs/AF086833.fa simulated.bwa.read1.fastq simulated.bwa.read2.fastq | samtools sort > simulated.bam
samtools index simulated.bam
Samtools/BCFTools are utilities for variant calling and manipulating VCFs and BCFs.
See http://samtools.github.io/bcftools/bcftools.html
Samtools/BCFTools can be used to call SNPs.
bcftools mpileup -Ovu -f refs/AF086833.fa simulated.bam > genotypes.vcf
-O output type
-v variants only
-u allow undefined tags (do not throw error if undefined tags in the format string, print "." instead
BCFTools can also be used to call variants from the genotypes.vcf file and create "observed-mutations.vcf".
bcftools call -vc -Ov genotypes.vcf > observed-mutations.vcf
-v uncompressed vcf
-c consensus caller (?)
-O output type
Look at simulated.mutations.vcf vs observed-mutations.vcf
Visualizing Large Scale Genomic Variation
Data Simulation
We'll download a reference sequence from NCBI (AF086833, ebola genome), place it in a refs directory that we create, and create a "bam" index using "bwa".
mkdir -p refs
efetch -db=nuccore -format=fasta -id=AF086833 > refs/reference.fa
bwa index refs/reference.fa
Alignments that contain the variation are the most informative. Retrieve a Python script that simulates sequencing every 500 bp long fragment of a genome exactly once, with no errors, with an Illumina paired-end protocol.
curl -O http://data.biostarhandbook.com/align/perfect_coverage.py
Generate perfect data
cat refs/reference.fa | python perfect_coverage.py
This code produces R1.fq and R2.fq. We will align these reads with "bwa mem" and then create the index file.
bwa mem refs/reference.fa R1.fq R2.fq | samtools sort > perfect.bam
samtools index perfect.bam
Visualize the output in IGV. What do you see?
Even under ideal conditions, there is a drop off (edge effects) in coverage at the beginning and end of the genome.
What does realistic and useful data look like?
We will simulate reads from the reference genome (reference.fa). 1. Make a copy of the "reference.fa" file called "genome.fa". 2. Modify the "genome.fa" file, generate reads from it and compare to the original "reference.fa".
cp refs/reference.fa genome.fa
wgsim -N 10000 genome.fa r1.fq r2.fq
bwa mem refs/reference.fa r1.fq r2.fq | samtools sort > imperfect.bam
samtools index imperfect.bam
FreeBayes variant caller
freebayes -f refs/AF086833.fa simulated.bam > variants2.vcf
is more than just a variant caller, multiple tools, complicated.
First choice for higher order organisms/ complex genomes (human, mouse, etc)
https://github.com/broadinstitute/gatk/releases download zip file (gate-4.1.8.1.zip) click on gatk folder
Main website: https://software.broadinstitute.org/gatk/
conda install gatk DOES NOT WORK
gatk-register downloaded-stuff DOES NOT WORK
picard CreateSequence Dictionary REFERENCE=refs/AF086833.fa OUTPUT=refs/AF086833.dict
Then run the GATK HaplotypeCaller.
gatk HaplotypeCaller -R refs/AF086833.fa -I $BAM -O variants3.vcf