Database for Annotation, Visualization and Integrated Discovery (DAVID) - an overview
Before getting started, remember to be signed on to the DNAnexus GOLD environment.
Lesson 17 review
In the previous class, we got an overview of functional and pathway analysis, which help to put RNA sequencing results into biological context. We were introduced to tools that could help us perform these analyses.
Learning objectives
This lesson will provide an overview of Database for Annotation, Visualization and Integrated Discovery (DAVID). We will
- Provide some background on DAVID, including
- what it does
- statistical methods that it uses
- some expected outputs
- data size limits
- how to get help
- Talk about input for DAVID
- Run an example and interpret results
Background on DAVID
What does DAVID do?
This tool was created and is maintained by the Laboratory of Human Retrovirology and Immunoinformatics at Frederick National Lab.
DAVID is used for functional analysis. Given an input gene list, DAVID will inform us of the following.
- Whether genes in a the input gene list are associated with diseases and links out to resources such as NCBI's MedGen
- Molecular functions that genes perform
- Biological pathways in which genes participate
- Other annotations (ie. cellular location, tissue expression, etc.) that the genes map onto
Background gene list
DAVID compares the overlap of user provided gene list to an annotation to the overlap of a background gene list to the same annotation. Thus, DAVID is really using the Fisher exact test to determine if the overlap of genes in the user input to a particular annotation is statistically different from what we would observe in the background. See Table 2 in Huang et al, Nature Protocols 2009 for more information the background gene set but essentially, the default background of the genome-wide genes is appropriate for studies that involve a genome-wide survey. However, DAVID provides custom background gene sets and users can specify their own.
"...make sure that the genes in your list are found in the background set that you have selected in DAVID otherwise, DAVID will ignore them." -- DAVID FAQ
Basic statistics behind DAVID
Fisher Exact Test
DAVID performs over representation analysis (ORA) at its core, which aims to find enriched molecular functions, pathways, or other annotations represented by the input gene list. In other words, many genes in the list map onto those molecular functions, pathways, or annotations.
With DAVID, we are essentially looking at contingency tables (Figure 1). The example in Figure 1 shows the number of user input genes and background genes (selected from the whole genome) that fall onto a particular pathway (ie. p53 signaling). However, how certain can we be that the number of user input genes that map to the pathway is observed not by random chance. In other words, do user input genes fall onto a pathway more often as compared to the backgroud or expected. DAVID uses the Fisher exact test to help us decide whether what we are observing is due to chance.
Figure 1: Contingency table showing the number of user input genes and background genes from the genome that fall onto a certain pathway. DAVID help documentations
Below are some resources for you to learn about or review this statistical procedure.
Fisher exact test from Wikipedia Hypergeometric distribution from Wikipedia Fisher exact test from Pathway Commons Hypergeometric distribution from Pathway Commons EASE score
Pathway Commons also provides a statistics primer that discusses those methods that are relevant to pathway analysis.
Pathway Commons Statistics Primer
Multiple Testing Correction
A problem that arises with enrichment analysis is the need to perform multiple statistical tests across many gene sets. In short, type I errors or false positives increases as the number of tests performed increases -- Pathway Commons multiple testing. To correct for multiple testing when using DAVID, users can choose either Bonferroni, Benjamini-Hochberg, or False Discovery Rate (FDR).
Reducing Redundancy
"Due to the redundant nature of annotations, Functional Annotation Chart presents similar/relevant annotations repeatedly. It dilutes the focus of the biology in the report. To reduce the redundancy, the Functional Annotation Clustering report groups/displays similar annotations together which makes the biology clearer and more focused..." -- DAVID help documents
DAVID uses the Kappa statistic is used to measure the level of similarities in genes between annotations and then applies fuzzy heuristic clustering to cluster groups of similar annotations.
Data size limits
"The goal of DAVID's design is to be able to efficiently upload and analyze a list consisting of <=3000 genes. All DAVID tools have been tested with lists in this range and should return results in a few seconds to no more than a few minutes. If running time is longer than a few minutes, please contact the DAVID Bioinformatic Team for help. Please note that Functional Annotation Clustering and Gene Functional Classification have a 3000 gene limit." -- DAVID FAQ
Getting help
Input data
Results and interpretation
Functional Annotation Chart
Chart Report is an annotation term focused view which lists annotation terms and their associated genes under study. -- DAVID help documents
- GO terms
- UniProt
- KEGG pathways
Functional Annotation Clustering
Functional Annotation Table
Provides a gene-centric view which lists the genes and their associated annotation terms... -- DAVID help documents
See https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4717906/ for difference between pathway and network analysis
DAVID publications: https://pubmed.ncbi.nlm.nih.gov/35325185/ http://www.ncbi.nlm.nih.gov/pubmed/19131956?dopt=Abstract
Disease Functional annotations Gene ontology General annotations Interactions Literature Pathways Protein domains Tissue expression
DAVID uses ORA at it's core, which is why it just requires a non-ranked list
Pathway Guide statistics primer
One class of enrichment analysis methods seek to identify those gene sets that share an unusually large number of genes with a list derived from experimental measurements. -- Pathway Guide Fisher Exact primer