22. Genomic Variation and Variation Calling copy
This page uses context taken directly from the Biostar Handbook by Istvan Albert.
Remember to activate the class bioinformatics environment.
conda activate bioinfo
Introduction to Genomic Variation
Genomic variations are typically categorized into different classes and are often denoted with a shortened acronym:
- SNP, a single nucleotide polymorphism - A change of a single base.
- INDEL, an insertion or a deletion - A single base added or removed.
- SNV, a single nucleotide variant. A SNPs or an INDEL. The change is still single basepair long.
- MNP, a multi-nucleotide polymorphism - Several consecutive SNPs in a block.
- MNV, a multi-nucleotide variant - Multiples SNPs and INDELs as a block.
- Short variations. Variations (MNVs) that are typically less than 50bp in length.
- Large-scale variations. Variations that are larger than 50bp.
- Structural variants. Variations on the kilobase scale that involve (one or more) chromosomes.
Single nucleotide polymorphisms (SNPs), or Single Nucleotide Variants (SNVs) are simple nucleotide changes where the variation is implied to occur at some frequency within the population.
In some cases, single nucleotide changes cause disease (known as Mendelian disorders), see OMIM for more information about heritable diseases.
Human reference genomes are produced by sequencing genetic information of multiple individuals. They are not representative of any individual, but a group of individuals.
Ploidy refers to the number of complete chromosomes in a cell or organism, and also represents the number of possible alleles of the DNA.
Genotyping involves determining the genetic sequence of an organism, including the identification of variants.
A haplotype is a group of genes that are inherited together from a parent, implying that they are located on the same chromosomal segment.
Introduction to Variant Calling
Variants are usually called from alignment (BAM) files. Here is the process: 1. Align reads to the reference. 2. Correct and refine alignments (optional). 3. Determine variants from the alignments. 4. Filter the resulting variants for the desired characteristics. 5. Annotate filtered variants.
Commonly used variant callers include: * bcftools * FreeBayes * GATK * VarScan2
VCF (Variant Call Format)
VCF is a data representation format used to describe variations in the genome. VCF files can contain information on any number of samples (thousands). VCF is composed of two sections, a header section and record sections.
VCF header section is located at the beginning of a VCF file.
VCF Poster: http://vcftools.sourceforge.net/VCF-poster.pdf
Resources for reading VCF files:
Biostar Handbook: https://www.biostarhandbook.com/vcf.html#vcf
VCF Short Summary: http://www.htslib.org/doc/vcf.html
Variant Calling with BCFTools
First we will retrieve the Ebola reference genome and put it in the "refs" directory. Next we need to create the index for the aligner (bwa index) and for IGV (samtools faidx). Let's create some new directories for our work.
mkdir -p newdir/refs
cd newdir
efetch -db=nuccore -format=fasta -id=AF086833 > refs/AF086833.fa
bwa index refs/AF086833.fa
samtools faidx refs/AF086833.fa
Now we can retrieve sequencing reads from the ebola genome,
fastq-dump -X 10000 --split-files SRR1553500
bwa mem refs/AF086833.fa SRR1553500_1.fastq SRR1553500_2.fastq | samtools sort > SRR1553500.bam
samtools index SRR1553500.bam
bcftools mpileup -Ovu -f refs/AF086833.fa SRR1553500.bam > genotypes.vcf
Where...
mpileup - multi-way pileup producing genotype likelihoods
-O output type
-v variants only
-u allow undefined tags (do not throw error if undefined tags in the format string, print "." instead)
-f indicates the faidx-indexed reference file is in the FASTA format
We will also use BCFTools to call variants from the genotypes.vcf file.
bcftools call --ploidy 1 -vm -Ov genotypes.vcf > variants1.vcf
-v variants only
-m multiallelic-caller
-O output type
Take a look at variants1.vcf with IGV.
where:
grey = reference sequence
dark blue = heterozygous
cyan = homozygous variant
Scroll to reveal variant information in the pop-up window. Also see http://software.broadinstitute.org/software/igv/viewing_vcf_files
Multi-sample Variant Calling
Multi-sample variant calling is an extension of single sample variant calling, where all samples are evaluated together, and the presence or absence of a variant is shown for all samples in parallel.
We will perform multi-sample variant calling by calling the variants simultaneously on several BAM files using BCFTools.
First we will get data for other SRR numbers within the ebola project, and create BAM files for each. Do this for:
SRR1972918
SRR1972919
SRR1972920
mkdir ebola
cd ebola
fastq-dump -X 10000 --split-files SRRnumber
bwa mem refs/AF086833.fa SRRnumber_1.fastq SRRnumber_2.fastq | samtools sort > SRRnumber.bam
where SRRnumber is one of the above SRR numbers.
With the BAM files in hand, we can run BCFTools to get genotypes and variants.
bcftools mpileup -Ovu -f refs/AF086833.fa bams/*.bam > many_genotypes.vcf
bcftools call --ploidy 1 -vm -Ou many_genotypes.vcf | bcftools norm -Ov -f ../refs/AF086833.fa -d all > many_variants.vcf
and see something like this on IGV. For more information on viewing VCF files in IGV, see http://software.broadinstitute.org/software/igv/viewing_vcf_files
Variant Normalization
Variant normalization involves the simplification of variant representation to follow these rules: * Represent the variant with as few letters as possible. * No allele may be zero length. * Variant must be "left aligned"
Some variant callers normalize variants by default, others have an action that performs variant normalization.
bcftools norm -f reference.fa samples.vcf
Variant Filtering
Filtering is done to keep or remove variants to satisfy certain constraints.
Make a new directory and download the example data (VCF file and index). This file contains variants in chromosome 19:400kb-500kb
mkdir hg19
cd hg19
curl -O http://data.biostarhandbook.com/variant/subset_hg19.vcf.gz
curl -O http://data.biostarhandbook.com/variant/subset_hg19.vcf.gz.tbi
You can extract information from VCF files with "BCFTools query" and the "-f" flag, for example.
bcftools query -f '%CHROM %POS %REF %ALT\n' subset_hg19.vcf.gz | head -3
19 400410 CA C
19 400666 G C
19 400742 C T
bcftools query -r '19:400300-400800' -f '%CHROM\t%POS\t%REF\t%ALT\n' subset_hg19.vcf.gz | head -3
19 400410 CA C
19 400666 G C
19 400742 C T
bcftools query -f '%CHROM\t%POS[\t%GT\t]\n' subset_hg19.vcf.gz | head -3
19 400410 0|0 0|0 0|0 0|0 0|0 0|0
19 400666 1|0 0|1 0|1 0|0 0|0 1|1
19 400742 0|0 0|0 0|0 0|0 0|0 0|0
Variant Annotation and Effect Prediction
Variant annotation means predicting the effects of genetic variants (SNPs, insertions, deletions, copy number variations (CNV) or structural variations (SV)) on the function of genes, transcripts, and protein sequence, as well as regulatory regions.
The variant effects fall into multiple categories:
- Correlate the location of the variant with other genomic annotations (e.g., upstream of a transcript, in the coding sequence, in non-coding RNA, in regulatory regions)
- List which genes and transcripts are affected by the variant.
- Determine consequence of the variant on protein sequence (e.g., stop_gained, missense, stop_lost, frameshift, other property changes)
- Match known variants that may have been found in different projects such as the 1000 Genomes Project.
Some potential effects: * stop_gained: A sequence variant whereby at least one base of a codon is changed, resulting in a premature stop codon, leading to a shortened transcript * inframe_insertion: An inframe nonsynonymous variant that inserts bases into in the coding sequence * missense_variant: A sequence variant, that changes one or more nucleotides, resulting in a different amino acid sequence but where the length is preserved * protein_altering_variant: A sequence_variant which is predicted to change the protein encoded in the coding sequence
We will use snpEff to annotate genomic variants and predict effects.
mkdir ebolaz
cd ebolaz
snpEff databases > listing.txt
cat listing.txt | grep ebola
ebola_zaire Ebola Zaire Virus KJ660346.1 http://downloads.sourceforge.net/project/snpeff/databases/v4_3/snpEff_v4_3_ebola_zaire.zip
To use this database, we need to download it. We can look at the database contents with "dump".
snpEff download ebola_zaire
snpEFF dump ebola_zaire | more
First let's download a different strain of Ebola (ebola zaire).
efetch -db=nuccore -format=fasta -id=KJ660346 > refs/KJ660346.fa
bwa index refs/KJ660346.fa
samtools faidx refs/KJ660346.fa
bcftools mpileup -Ovu -f refs/KJ660346.fa bams/*.bam > genotypes.vcf
bcftools call --ploidy 1 -vm -Ou ebola_genotypes.vcf | bcftools norm -Ov -f refs/KJ660346.fa -d all > ebola_variants.vcf
snpEff ebola_zaire ebola_variants.vcf > ebola_annotated.vcf

Be sure to take a look at snpEff_summary.html with your web browser.
Let's try another variant effect predictor (VEP)
http://useast.ensembl.org/Tools/VEP
Find this file that we downloaded previously (subset_hg19.vcf.gz)
Results: