Skip to content

Visualizing Large Scale Genomic Variation copy

Generating VCF Files (Simulated data)

VCF files are produced by running a variant caller on one or more BAM alignment files.

We will download the ebola genome (AF086833) into a "refs" directory, create a "bwa index" for the ebola genome, create a "samtools index" for the ebola genome, and then simulated reads from the ebola genome using "dwgsim".

First, install dwgsim.

conda install dwgsim
DWGSIM is a whole Genome Simulator for Next-Generation Sequencing (https://github.com/nh13/DWGSIM.

Usage (at command line): dwgsim [options]

Then create the directories, download the data, and create the simulated reads.

mkdir variants
cd variants
mkdir -p refs
efetch -db=nuccore -format=fasta -id=AF086833 > refs/AF086833.fa
bwa index refs/AF086833.fa
samtools index refs/AF086833.fa
dwgsim refs/AF086833.fa simulated
Creates:
simulated.bfast.fastq
simulated.bwa.read1.fastq 
simulated.bwa.read2.fastq 
simulated.mutations.txt
simulated.mutations.vcf

Take a look at the output file "simulated.mutations.vcf".

##contig=<ID=AF086833.2,length=18959>
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=pl,Number=1,Type=Integer,Description="Phasing: 1 - HET contig 1, #2 - HET contig #2, 3 - HOM both contigs">
##INFO=<ID=mt,Number=1,Type=String,Description="Variant Type: SUBSTITUTE/INSERT/DELETE">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
AF086833.2      140     .       T       A       100     PASS    AF=0.5;pl=2;mt=SUBSTITUTE
AF086833.2      354     .       A       T       100     PASS    AF=1.0;pl=3;mt=SUBSTITUTE
AF086833.2      882     .       C       G       100     PASS    AF=1.0;pl=3;mt=SUBSTITUTE
AF086833.2      1308    .       T       A       100     PASS    AF=1.0;pl=3;mt=SUBSTITUTE
AF086833.2      4590    .       A       C       100     PASS    AF=0.5;pl=1;mt=SUBSTITUTE
AF086833.2      5856    .       A       T       100     PASS    AF=0.5;pl=1;mt=SUBSTITUTE
AF086833.2      6120    .       C       T       100     PASS    AF=0.5;pl=1;mt=SUBSTITUTE
AF086833.2      7979    .       T       G       100     PASS    AF=1.0;pl=3;mt=SUBSTITUTE
AF086833.2      11398   .       T       A       100     PASS    AF=1.0;pl=3;mt=SUBSTITUTE
AF086833.2      13756   .       T       C       100     PASS    AF=1.0;pl=3;mt=SUBSTITUTE
AF086833.2      14173   .       A       G       100     PASS    AF=1.0;pl=3;mt=SUBSTITUTE
AF086833.2      14644   .       T       C       100     PASS    AF=1.0;pl=3;mt=SUBSTITUTE
AF086833.2      15864   .       CA      C       100     PASS    AF=1.0;pl=3;mt=DELETE
AF086833.2      16499   .       T       A       100     PASS    AF=0.5;pl=1;mt=SUBSTITUTE
AF086833.2      17596   .       A       AA      100     PASS    AF=0.5;pl=2;mt=INSERT
We will create a "simulated.bam" file from the simulated reads and index the "bam" file with "samtools index" so we can look at it with IGV.
bwa mem refs/AF086833.fa simulated.bwa.read1.fastq simulated.bwa.read2.fastq | samtools sort > simulated.bam
samtools index simulated.bam

Samtools/BCFTools are utilities for variant calling and manipulating VCFs and BCFs.

See http://samtools.github.io/bcftools/bcftools.html

Samtools/BCFTools can be used to call SNPs.

bcftools mpileup -Ovu -f refs/AF086833.fa simulated.bam > genotypes.vcf
mpileup - multi-way pileup producing genotype likelihoods

-O output type

-v variants only

-u allow undefined tags (do not throw error if undefined tags in the format string, print "." instead

BCFTools can also be used to call variants from the genotypes.vcf file and create "observed-mutations.vcf".

bcftools call -vc -Ov genotypes.vcf > observed-mutations.vcf
call (SNP/indel calling)

-v uncompressed vcf

-c consensus caller (?)

-O output type

Look at simulated.mutations.vcf vs observed-mutations.vcf

Screen Shot 2020-08-10 at 1 40 23 PM


Visualizing Large Scale Genomic Variation

Data Simulation

We'll download a reference sequence from NCBI (AF086833, ebola genome), place it in a refs directory that we create, and create a "bam" index using "bwa".

mkdir -p refs
efetch -db=nuccore -format=fasta -id=AF086833 > refs/reference.fa
bwa index refs/reference.fa

Alignments that contain the variation are the most informative. Retrieve a Python script that simulates sequencing every 500 bp long fragment of a genome exactly once, with no errors, with an Illumina paired-end protocol.

curl -O http://data.biostarhandbook.com/align/perfect_coverage.py

Generate perfect data

cat refs/reference.fa | python perfect_coverage.py

This code produces R1.fq and R2.fq. We will align these reads with "bwa mem" and then create the index file.

bwa mem refs/reference.fa R1.fq R2.fq | samtools sort > perfect.bam
samtools index perfect.bam

Visualize the output in IGV. What do you see?

Screen Shot 2020-07-31 at 12 21 17 PM

Even under ideal conditions, there is a drop off (edge effects) in coverage at the beginning and end of the genome. Screen Shot 2020-07-31 at 12 25 31 PM

What does realistic and useful data look like?

We will simulate reads from the reference genome (reference.fa). 1. Make a copy of the "reference.fa" file called "genome.fa". 2. Modify the "genome.fa" file, generate reads from it and compare to the original "reference.fa".

cp refs/reference.fa genome.fa
wgsim -N 10000 genome.fa r1.fq r2.fq
bwa mem refs/reference.fa r1.fq r2.fq | samtools sort > imperfect.bam
samtools index imperfect.bam

Screen Shot 2020-07-31 at 12 34 16 PM

FreeBayes variant caller

freebayes -f refs/AF086833.fa simulated.bam > variants2.vcf
GATK (The Genome Analysis Toolkit)

is more than just a variant caller, multiple tools, complicated.

First choice for higher order organisms/ complex genomes (human, mouse, etc)

https://github.com/broadinstitute/gatk/releases download zip file (gate-4.1.8.1.zip) click on gatk folder

Main website: https://software.broadinstitute.org/gatk/

conda install gatk  DOES NOT WORK
gatk-register downloaded-stuff DOES NOT WORK
Picard is a set of command line tools for manipulating NGS data formats SAM/BAM/CRAM/VCF. https://github.com/broadinstitute/picard First create a Dictionary Index.
picard CreateSequence Dictionary REFERENCE=refs/AF086833.fa OUTPUT=refs/AF086833.dict

Then run the GATK HaplotypeCaller.

gatk HaplotypeCaller -R refs/AF086833.fa -I $BAM -O variants3.vcf