22. Genomic Variation and Variation Calling copy

This page uses context taken directly from the Biostar Handbook by Istvan Albert.

Remember to activate the class bioinformatics environment.

conda activate bioinfo

Introduction to Genomic Variation

Genomic variations are typically categorized into different classes and are often denoted with a shortened acronym:

SNP, a single nucleotide polymorphism - A change of a single base.
INDEL, an insertion or a deletion - A single base added or removed.
SNV, a single nucleotide variant. A SNPs or an INDEL. The change is still single basepair long.
MNP, a multi-nucleotide polymorphism - Several consecutive SNPs in a block.
MNV, a multi-nucleotide variant - Multiples SNPs and INDELs as a block.
Short variations. Variations (MNVs) that are typically less than 50bp in length.
Large-scale variations. Variations that are larger than 50bp.
Structural variants. Variations on the kilobase scale that involve (one or more) chromosomes.

Single nucleotide polymorphisms (SNPs), or Single Nucleotide Variants (SNVs) are simple nucleotide changes where the variation is implied to occur at some frequency within the population.

In some cases, single nucleotide changes cause disease (known as Mendelian disorders), see OMIM for more information about heritable diseases.

Human reference genomes are produced by sequencing genetic information of multiple individuals. They are not representative of any individual, but a group of individuals.

Ploidy refers to the number of complete chromosomes in a cell or organism, and also represents the number of possible alleles of the DNA.

Genotyping involves determining the genetic sequence of an organism, including the identification of variants.

A haplotype is a group of genes that are inherited together from a parent, implying that they are located on the same chromosomal segment.

Introduction to Variant Calling

Variants are usually called from alignment (BAM) files. Here is the process: 1. Align reads to the reference. 2. Correct and refine alignments (optional). 3. Determine variants from the alignments. 4. Filter the resulting variants for the desired characteristics. 5. Annotate filtered variants.

Commonly used variant callers include: * bcftools * FreeBayes * GATK * VarScan2

VCF (Variant Call Format)

VCF is a data representation format used to describe variations in the genome. VCF files can contain information on any number of samples (thousands). VCF is composed of two sections, a header section and record sections.

VCF header section is located at the beginning of a VCF file.

VCF Poster: http://vcftools.sourceforge.net/VCF-poster.pdf Screen Shot 2020-08-11 at 10 46 58 AM Resources for reading VCF files:

Biostar Handbook: https://www.biostarhandbook.com/vcf.html#vcf

VCF Short Summary: http://www.htslib.org/doc/vcf.html

Variant Calling with BCFTools

First we will retrieve the Ebola reference genome and put it in the "refs" directory. Next we need to create the index for the aligner (bwa index) and for IGV (samtools faidx). Let's create some new directories for our work.

mkdir -p newdir/refs
cd newdir

and then download the reference genome.

efetch -db=nuccore -format=fasta -id=AF086833 > refs/AF086833.fa
bwa index refs/AF086833.fa
samtools faidx refs/AF086833.fa

Now we can retrieve sequencing reads from the ebola genome,

fastq-dump -X 10000 --split-files SRR1553500

and do the alignment.

bwa mem refs/AF086833.fa SRR1553500_1.fastq SRR1553500_2.fastq | samtools sort > SRR1553500.bam
samtools index SRR1553500.bam

Now we'll create the vcf file using BCFTools.

bcftools mpileup -Ovu -f refs/AF086833.fa SRR1553500.bam > genotypes.vcf

Where...

mpileup - multi-way pileup producing genotype likelihoods

-O output type

-v variants only

-u allow undefined tags (do not throw error if undefined tags in the format string, print "." instead)

-f indicates the faidx-indexed reference file is in the FASTA format

We will also use BCFTools to call variants from the genotypes.vcf file.

bcftools call --ploidy 1 -vm -Ov genotypes.vcf > variants1.vcf

call (SNP/indel calling)

-v variants only

-m multiallelic-caller

-O output type

Take a look at variants1.vcf with IGV.

Screen Shot 2020-08-13 at 10 32 37 AM

where:

grey = reference sequence

dark blue = heterozygous

cyan = homozygous variant

Scroll to reveal variant information in the pop-up window. Also see http://software.broadinstitute.org/software/igv/viewing_vcf_files

Multi-sample Variant Calling

Multi-sample variant calling is an extension of single sample variant calling, where all samples are evaluated together, and the presence or absence of a variant is shown for all samples in parallel.

We will perform multi-sample variant calling by calling the variants simultaneously on several BAM files using BCFTools.

First we will get data for other SRR numbers within the ebola project, and create BAM files for each. Do this for:

SRR1972918

SRR1972919

SRR1972920

mkdir ebola
cd ebola
fastq-dump -X 10000 --split-files SRRnumber 
bwa mem refs/AF086833.fa SRRnumber_1.fastq SRRnumber_2.fastq | samtools sort > SRRnumber.bam

where SRRnumber is one of the above SRR numbers.

With the BAM files in hand, we can run BCFTools to get genotypes and variants.

bcftools mpileup -Ovu -f refs/AF086833.fa bams/*.bam > many_genotypes.vcf
bcftools call --ploidy 1 -vm -Ou many_genotypes.vcf | bcftools norm -Ov -f ../refs/AF086833.fa -d all > many_variants.vcf

and see something like this on IGV. For more information on viewing VCF files in IGV, see http://software.broadinstitute.org/software/igv/viewing_vcf_files

Screen Shot 2020-08-10 at 2 32 21 PM

Variant Normalization

Variant normalization involves the simplification of variant representation to follow these rules: * Represent the variant with as few letters as possible. * No allele may be zero length. * Variant must be "left aligned"

Some variant callers normalize variants by default, others have an action that performs variant normalization.

bcftools norm -f reference.fa samples.vcf

Variant Filtering

Filtering is done to keep or remove variants to satisfy certain constraints.

Make a new directory and download the example data (VCF file and index). This file contains variants in chromosome 19:400kb-500kb

mkdir hg19
cd hg19
curl -O http://data.biostarhandbook.com/variant/subset_hg19.vcf.gz
curl -O http://data.biostarhandbook.com/variant/subset_hg19.vcf.gz.tbi

You can extract information from VCF files with "BCFTools query" and the "-f" flag, for example.

bcftools query -f '%CHROM %POS %REF %ALT\n' subset_hg19.vcf.gz | head -3
19 400410 CA C
19 400666 G C
19 400742 C T

To extract all variants in a particular region.

bcftools query -r '19:400300-400800' -f '%CHROM\t%POS\t%REF\t%ALT\n' subset_hg19.vcf.gz | head -3
19  400410  CA  C
19  400666  G   C
19  400742  C   T

To print the genotype (GT) of all samples at each position.

bcftools query -f '%CHROM\t%POS[\t%GT\t]\n' subset_hg19.vcf.gz | head -3
19  400410  0|0     0|0     0|0     0|0     0|0     0|0 
19  400666  1|0     0|1     0|1     0|0     0|0     1|1 
19  400742  0|0     0|0     0|0     0|0     0|0     0|0

Variant Annotation and Effect Prediction

Variant annotation means predicting the effects of genetic variants (SNPs, insertions, deletions, copy number variations (CNV) or structural variations (SV)) on the function of genes, transcripts, and protein sequence, as well as regulatory regions.

The variant effects fall into multiple categories:

Correlate the location of the variant with other genomic annotations (e.g., upstream of a transcript, in the coding sequence, in non-coding RNA, in regulatory regions)
List which genes and transcripts are affected by the variant.
Determine consequence of the variant on protein sequence (e.g., stop_gained, missense, stop_lost, frameshift, other property changes)
Match known variants that may have been found in different projects such as the 1000 Genomes Project.

Some potential effects: * stop_gained: A sequence variant whereby at least one base of a codon is changed, resulting in a premature stop codon, leading to a shortened transcript * inframe_insertion: An inframe nonsynonymous variant that inserts bases into in the coding sequence * missense_variant: A sequence variant, that changes one or more nucleotides, resulting in a different amino acid sequence but where the length is preserved * protein_altering_variant: A sequence_variant which is predicted to change the protein encoded in the coding sequence

We will use snpEff to annotate genomic variants and predict effects.

mkdir ebolaz
cd ebolaz
snpEff databases > listing.txt
cat listing.txt | grep ebola
ebola_zaire                                                     Ebola Zaire Virus KJ660346.1                                                                                http://downloads.sourceforge.net/project/snpeff/databases/v4_3/snpEff_v4_3_ebola_zaire.zip

To use this database, we need to download it. We can look at the database contents with "dump".

snpEff download ebola_zaire
snpEFF dump ebola_zaire | more

First let's download a different strain of Ebola (ebola zaire).

efetch -db=nuccore -format=fasta -id=KJ660346 > refs/KJ660346.fa
bwa index refs/KJ660346.fa
samtools faidx refs/KJ660346.fa

We can run bcftools to find variants. Need to recreate the bam files using reference KJ660346 (SRR1972918, SRR1972919 and SRR1972920). These steps do not call any variants.

bcftools mpileup -Ovu -f refs/KJ660346.fa bams/*.bam > genotypes.vcf

bcftools call --ploidy 1 -vm -Ou ebola_genotypes.vcf | bcftools norm -Ov -f refs/KJ660346.fa -d all > ebola_variants.vcf
snpEff ebola_zaire ebola_variants.vcf > ebola_annotated.vcf

Be sure to take a look at snpEff_summary.html with your web browser.

Screen Shot 2020-08-11 at 3 02 56 PM

Screen Shot 2020-08-11 at 3 28 20 PM

Let's try another variant effect predictor (VEP)

http://useast.ensembl.org/Tools/VEP

Find this file that we downloaded previously (subset_hg19.vcf.gz)

Results:

Screen Shot 2020-08-11 at 3 36 00 PM