Database for Annotation, Visualization and Integrated Discovery (DAVID) - an overview

Before getting started, remember to be signed on to the DNAnexus GOLD environment.

Lesson 17 review

In the previous class, we got an overview of functional and pathway analysis, which help to put RNA sequencing results into biological context. We were introduced to tools that could help us perform these analyses.

Learning objectives

This lesson will provide an overview of Database for Annotation, Visualization and Integrated Discovery (DAVID). We will

Provide some background on DAVID, including
- what it does
- statistical methods that it uses
- some expected outputs
- data size limits
- how to get help
Talk about input for DAVID
Run an example and interpret results

Background on DAVID

What does DAVID do?

This tool was created and is maintained by the Laboratory of Human Retrovirology and Immunoinformatics at Frederick National Lab.

DAVID is used for functional analysis. Given an input gene list, DAVID will inform us of the following.

Whether genes in a the input gene list are associated with diseases and links out to resources such as NCBI's MedGen
Molecular functions that genes perform
Biological pathways in which genes participate
Other annotations (ie. cellular location, tissue expression, etc.) that the genes map onto

Background gene list

DAVID compares the overlap of user provided gene list to an annotation to the overlap of a background gene list to the same annotation. Thus, DAVID is really using the Fisher exact test to determine if the overlap of genes in the user input to a particular annotation is statistically different from what we would observe in the background. See Table 2 in Huang et al, Nature Protocols 2009 for more information the background gene set but essentially, the default background of the genome-wide genes is appropriate for studies that involve a genome-wide survey. However, DAVID provides custom background gene sets and users can specify their own.

"...make sure that the genes in your list are found in the background set that you have selected in DAVID otherwise, DAVID will ignore them." -- DAVID FAQ

Basic statistics behind DAVID

Fisher Exact Test

DAVID performs over representation analysis (ORA) at its core, which aims to find enriched molecular functions, pathways, or other annotations represented by the input gene list. In other words, many genes in the list map onto those molecular functions, pathways, or annotations.

With DAVID, we are essentially looking at contingency tables (Figure 1). The example in Figure 1 shows the number of user input genes and background genes (selected from the whole genome) that fall onto a particular pathway (ie. p53 signaling). However, how certain can we be that the number of user input genes that map to the pathway is observed not by random chance. In other words, do user input genes fall onto a pathway more often as compared to the backgroud or expected. DAVID uses the Fisher exact test to help us decide whether what we are observing is due to chance.

Figure 1: Contingency table showing the number of user input genes and background genes from the genome that fall onto a certain pathway. DAVID help documentations

Below are some resources for you to learn about or review this statistical procedure.

Fisher exact test from Wikipedia Hypergeometric distribution from Wikipedia Fisher exact test from Pathway Commons Hypergeometric distribution from Pathway Commons EASE score

Pathway Commons also provides a statistics primer that discusses those methods that are relevant to pathway analysis.

Pathway Commons Statistics Primer

Multiple Testing Correction

A problem that arises with enrichment analysis is the need to perform multiple statistical tests across many gene sets. In short, type I errors or false positives increases as the number of tests performed increases -- Pathway Commons multiple testing. To correct for multiple testing when using DAVID, users can choose either Bonferroni, Benjamini-Hochberg, or False Discovery Rate (FDR).

Reducing Redundancy

"Due to the redundant nature of annotations, Functional Annotation Chart presents similar/relevant annotations repeatedly. It dilutes the focus of the biology in the report. To reduce the redundancy, the Functional Annotation Clustering report groups/displays similar annotations together which makes the biology clearer and more focused..." -- DAVID help documents

DAVID uses the Kappa statistic is used to measure the level of similarities in genes between annotations and then applies fuzzy heuristic clustering to cluster groups of similar annotations.

Data size limits

"The goal of DAVID's design is to be able to efficiently upload and analyze a list consisting of <=3000 genes. All DAVID tools have been tested with lists in this range and should return results in a few seconds to no more than a few minutes. If running time is longer than a few minutes, please contact the DAVID Bioinformatic Team for help. Please note that Functional Annotation Clustering and Gene Functional Classification have a 3000 gene limit." -- DAVID FAQ

Getting help

DAVID question and forum

DAVID FAQ

DAVID help documentations

Input data

Results and interpretation

Functional Annotation Chart

Chart Report is an annotation term focused view which lists annotation terms and their associated genes under study. -- DAVID help documents

GO terms
UniProt
- keywords (starts with UP_KW)
- sequence features (UP_SEQ_FEATURE)
KEGG pathways

Functional Annotation Clustering

Functional Annotation Table

Provides a gene-centric view which lists the genes and their associated annotation terms... -- DAVID help documents

See https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4717906/ for difference between pathway and network analysis

DAVID publications: https://pubmed.ncbi.nlm.nih.gov/35325185/ http://www.ncbi.nlm.nih.gov/pubmed/19131956?dopt=Abstract

Disease Functional annotations Gene ontology General annotations Interactions Literature Pathways Protein domains Tissue expression

DAVID uses ORA at it's core, which is why it just requires a non-ranked list

Pathway Guide statistics primer

One class of enrichment analysis methods seek to identify those gene sets that share an unusually large number of genes with a list derived from experimental measurements. -- Pathway Guide Fisher Exact primer