Database for Annotation, Visualization and Integrated Discovery (DAVID) - An Overview

Lesson 15 review

In Lesson 15, we learned about functional enrichment, pathway analysis, and related concepts. These analyses help us put RNA sequencing results into biological context by informing us of the biomolecular pathways, biological functions, cellular localities, etc. of genes in our study.

Learning objectives

This lesson will provide an overview of the Database for Annotation, Visualization and Integrated Discovery (DAVID). DAVID is one of many tools that can be used to perform functional enrichment analysis.

In this lesson, we will

Learn more about DAVID
- Available tools
- Underlying statistical methods
- Some expected outputs
- Data size limits
- How to get help
Run an example and interpret results

Potential Files of Interest:

Files used in this tutorial or related to files used in this tutorial include:

Up-regulated gene list - up_logfold3.txt
Background gene list - background_expressed.txt
Differential Expression Results - hcc1395_deg.csv

Background on DAVID

What does DAVID do?

DAVID is a popular bioinformatics resource system including a web server and web service for functional annotation and enrichment analyses of gene lists. It consists of a comprehensive knowledgebase and a set of functional analysis tools. - Sherman et al. 2022

Over 40 functional categories from dozens of independent public sources (databases) are collected and integrated into the DAVID Knowledgebase. - https://davidbioinformatics.nih.gov/helps/knowledgebase/DAVID_gene.html#coverage

DAVID was created and is maintained by the Laboratory of Human Retrovirology and Immunoinformatics at Frederick National Lab. It has been cited in 72,287 papers since its debut in 2003 as of 23 July 2024.

Tools available in DAVID include:

Functional Annotation Clustering
Functional Annotation Chart
Functional Annotation Table
Gene Functional Classification
Gene ID Conversion
Gene Name Batch Viewer
Ortholog Tool

With DAVID, we can:

Identify enriched biological themes, particularly GO terms.
Discover enriched functional-related gene groups.
Cluster redundant annotation terms.
Visualize genes on BioCarta & KEGG pathway maps.
Display many-genes-to-many-terms relationships in 2D.
Search for functionally related genes not in the list.
List interacting proteins.
Explore gene names in batch mode.

Note

DAVID is not just for gene lists associated with humans; it includes annotations relevant to thousands of species.

Basic statistics behind DAVID

Fisher Exact Test

DAVID performs over representation analysis (ORA) at its core, which aims to find enriched molecular functions, pathways, or other annotations represented by the input gene list.

With DAVID, we are essentially looking at contingency tables (Figure 1). The example in Figure 1 shows the number of user input genes and background genes (selected from the whole genome) that fall onto a particular pathway (ie. p53 signaling). We want to know the probability that the number of user input genes that map to a given pathway is by random chance. In other words, do user input genes fall onto a pathway more often as compared to the background than expected? DAVID uses a modified Fisher exact test to determine whether a pathway is over-represented in our gene list.

DAVID uses an EASE score (default = 1), which puts a more conservative spin on the Fisher's exact test by subtracting from the left hand side of our contingency table.

Contingency Table

Figure 1: Contingency table showing the number of user input genes and background genes from the genome that fall onto a certain pathway. DAVID help documentations

Background gene list

DAVID compares the overlap of a user provided gene list to an annotation to the overlap of a background gene list to the same annotation. So what is an appropriate background gene list?

Ideally, a background gene list should represent "the ‘universe’ of possible genes that could be called as significantly regulated in the experiment" (Timmons et al. 2015). In RNA-Seq this would not be the whole genome, but rather the genes that were expressed. A background gene list representing the whole genome would be more appropriate for experiments surveying the entire genome (e.g., genetic variation experiments). See Wijesooriya et al. (2022) for a more detailed discussion.

The default background gene list in DAVID is the whole genome; however, you also have the option to upload your own background gene list.

A word of caution.

If you do use a custom background set, "make sure that the genes in your list are found in the background set that you have selected in DAVID; otherwise, DAVID will ignore them." -- DAVID FAQ

Below are some resources for you to learn about or review this statistical procedure.

Multiple Testing Correction

A problem that arises with enrichment analysis is the need to perform multiple statistical tests across many gene sets. In short, the number of type I errors or false positives increases as the number of tests performed increases -- Pathway Commons multiple testing. With DAVID, users can choose to use either Bonferroni, Benjamini-Hochberg, or False Discovery Rate (FDR) to correct for multiple testing.

Reducing Redundancy

The Functional Annotation Clustering tool can be used to reduce the redundancy evident in the Functional Annotation Chart results. See more below.

Briefly, DAVID uses the Kappa statistic to measure the level of similarities in genes between annotations and then applies fuzzy heuristic clustering to cluster groups of similar annotations.

Data Size Limits

DAVID works best with gene lists comprised of <= 3000 genes, and the Functional Annotation Clustering and Gene Functional Classification tools have a 3000 gene limit.

Getting help

DAVID question and forum

Contact the DAVID Bioinformatics Team via email

DAVID FAQ

DAVID help documentations

DAVID quick start tutorial

Starting an analysis in DAVID

USE GOOGLE CHROME TO INTERACT WITH DAVID

Click on the Start Analysis button to initiate an analysis, this will take us to the Analysis Wizard. DAVID Step 1

Supplying input

Tasks to do at the Analysis Wizard:

Provide an input gene list (either copy paste or upload as a text file)
Specify the gene identifier type. Gene identifiers can be gene symbol, Ensembl IDs, Entrez IDs, Genbank IDs, Refseq IDs, etc.)

Note

DAVID only recognizes “Official Gene Symbol” based on the latest update. Therefore, names changes for "Official Gene Symbols" (e.g., SEPT5 -> SEPTIN5) can impact results.

Specify whether the input gene list in the "gene list" or "background gene list".
Submit the list for analysis

Here, we will provide the genes that are upregulated in the Hcc1395 Tumor samples. However, we will not be using results obtained in Module 2, as these were derived from filtered fastqs, which only included reads from Chromosome 22. Instead, we will use up-regulated genes from non-filtered fastqs described here.

Note

These samples were quickly processed using the CCBR pipeliner workflow Renee. The counts file was obtained without checking any of the QC reports to save time.

Up-regulated genes (3,008 total) were obtained by filtering differential expression results obtained from DESeq2 for log2 fold change values greater than or equal to 3 with a false discovery rate of less than or equal to 0.01. This choice was arbitrary, and was primarily driven by DAVID input thresholds. You can download the gene list here.

A background gene list including genes with greater than 0 expression can be found here.

background gene list

For simplicity, I am using the default background gene list in DAVID. However, see the above note regarding appropriate gene lists. You can use the file above (background_expressed.txt) to get an idea concerning how your selection of a background can affect the results.

Results and interpretation

Annotation Results Summary

Once we have submitted our gene list for analysis, DAVID takes us to the Annotation Summary Results page. Here, we will find a summary of all of the annotations in DAVID associated with our input gene list.

This page confirms the name of our gene list (up) and, the background gene list that we are using (Homo sapiens), and the number of IDs in our list. We can navigate the Annotation Summary Results page to obtain different insights to our data.

Here we can select and explore multiple classes of annotation categories including GO terms, protein-protein interactions, protein functional domains, disease associations, bio-pathways, sequence general features, functional summaries, tissue expression, literature, etc.

From the Annotation Summary Results, we can see information about the number of genes in our list involved in a given category. The annotations for these genes can be viewed by selecting the horizontal bars.

A functional annotation chart report can be viewed for individual annotation categories by selecting "Chart", or for combined categories (See Functional Annotation Chart for more details on combined reports.)

David Analysis

To better understand our results, let's take a closer look at the disease annotations.

Disease Annotations

DAVID pulls disease annotations from different sources. Clicking on the Chart button will take us to a chart view showing the disease records from a given database in which our input genes map.

DISGENET
GAD_DISEASE (GAD = Genetic Association Database)
GAD_DISEASE_CLASS
OMIM
UniProt - UP_KW_DISEASE where KW stands for keyword

DISGENET

Disease Annotation - chart view

In the chart view, we are presented with the disease(s) found in a particular database that our input genes map to. Notice that we can sort this list by gene count, percentage, p-value, or adjusted p-value. The chart view shows us the results of our modified Fisher's exact. More on this later.

Clicking on one of the disease terms in the list will provide a description of the term. The source of this description will vary depending on the category selected. For example, terms under DISGENET send us to NCBI's MedGen.

Note

The chart view for some annotation databases may not have records, or will have few records. This is because there were no annotations that met the statistical threshold (Default = 0.1).

Here, we can also look at related terms by selecting "RT", and we can access a Gene Report by selecting the blue bar.

Related Terms:
Examining the related terms can help you identify related biological processes or terms to get a better idea of the underlying biology. A kappa statistic is used to measure the degree of agreement between participating genes in terms. The closer to 1, the greater the similarity. Greater than 0.7 indicates a strong agreement..

Disease Annotation - link to external resource

In the chart view shown above, we clicked on mammary neoplasm, and this took us to the corresponding record in NCBI's MedGen. MedGen is NCBI's database that contains organized information pertaining to human gene-disease relationships.

Disease Annotation - gene view

If we click on the blue bar next to the chart button, we will be taken to a gene view of the disease terms.

Disease Annotation - gene view results

The gene view lists the disease terms for each gene in the annotation category of interest. This is the Functional Annotation Table.

Other annotations

We see a similar organization of information for other annotation categories. For instance, DAVID pulls information on biomolecular processes, functions, and pathways from several sources such as UniProt, GO, KEGG, Reactome, and Wikipathways.

For Gene Ontology (GO), the GO Direct categories are selected by default. These provide GO mappings directly annotated by the source database (no parent terms included). The user can also opt for all levels from GO or specific levels, with level one including broader terms and level 5 including more specific terms. The GO FAT category filters out very broad GO terms based on a measured specificity of each term (not level-specificity).

We can look at results across multiple categories by selecting annotation sources (databases) of interest. By Default, DAVID will select annotation sources in red, but these can also be deselected.

Pathway Maps

DAVID includes several pathway databases, and also includes a Pathway Viewer for annotations from KEGG, BioCarta, and WikiPathways. This viewer displays the user's genes on pathway maps.

BioCarta

Pathway information generated by BioCarta is no longer maintained by BioCarta or CGAP but is retained in DAVID.

Kegg Pathway Viewer — DAVID Pathway Viewer: Proteoglygans in Cancer from KEGG

DAVID Pathway Viewer: Pleural mesothelioma from WikiPathways

Functional Annotation Chart

Chart Report is an annotation term focused view which lists annotation terms and their associated genes under study. -- DAVID help documents

As we have seen, you can view the Functional Annotation Chart for specific annotation categories. However, the "Functional Annotation Chart button" provides our enrichment results across multiple categories. Without customizing which categories (or databases) to include, DAVID will automatically use pre-selected defaults. This view does not eliminate redundancies across annotation categories and databases.

Chart results have to meet certain criteria to be included:

EASE Score Threshold (Maximum Probability, p-value) <= 0.1
Count Threshold (Minimum gene count for an annotation term) >= 2

These thresholds are customizable using "Options".

Customizable Options

You can also add additional columns to the output under "Display". For example, you can look at other methods for adjusting p-values, fold enrichment, Fisher's Exact p-value, etc.

What is "fold enrichment"?

In this context, "fold enrichment" refers to the enrichment of the term in your gene list as compared to the background population of genes. For example, if 40/400 (i.e. 10%) of the genes from your list are involved in "kinase activity" and the background population ratio of genes involved in "kinase activity" is 300/30000 genes (i.e. 1%). There is a 10-fold(10%/1%) enrichment of genes from your list involved in "kinase activity" compared to the population.

Functional Annotation Clustering

The Functional Annotation Clustering tool groups similar annotations together to reduce the redundancy seen in the Functional Annotation Chart results. This eases the interpretation of the findings.

The Functional Annotation Clustering integrates the same techniques of Kappa statistics to measure the degree of the common genes between two annotations, and fuzzy heuristic clustering (used in Gene Functional Classification Tool) to classify the groups of similar annotations according to Kappa values. In this sense, the more common genes annotations share, the higher chance they will be grouped together. - DAVID Documentation - Functional Annotation Clustering

For a more in-depth understanding of the methods used, see Huang et al. 2007.

Functional Annotation Clustering Report

What is the Group Enrichment Score?

This is the geometric mean (in -log scale) of member's p-values in a corresponding annotation cluster. More enriched terms will be toward the top with higher group enrichment scores. This is used to rank biological significance.

Gene-Annotation Association View:
The Gene-Annotation Association Viewer can be used to examine the relationships between genes and annotation terms. Green indicates a term is associated with a given gene. We can use this to look at genes shared across terms.

Gene Term Associations

Modifying Parameters:
We can fine tune how DAVID clusters the annotations using the parameters below.

Adjusting parameters for clustering

"Clustering Stringency (lowest → highest): A high-level single control to establish a set of detailed parameters involved in functional classification algorithms. In general, the higher stringency setting generates less functional groups with more tightly associated genes in each group, so that more genes will be unclustered. The default setting is Medium, which gives balanced results for most cases based on our studies. Customization allows you to control Advanced options." -- DAVID functional classification documentations
"Similarity Term Overlap (any value ≥ 0; default = 4): The minimum number of annotation terms overlapped between two genes in order to be qualified for kappa calculation. This parameter is to maintain necessary statistical power to make the kappa value more meaningful. The higher the value, the more meaningful the result is." -- DAVID functional classification documentations
"Similarity Threshold (any value between 0 to 1; default = 0.35): The minimum kappa value to be considered significant. A higher setting will lead to more genes going unclustered, which leads to a higher quality functional classification result with fewer groups and fewer gene members. Kappa value of 0.3 starts giving meaningful biology based on our genome-wide distribution study. Anything below 0.3 has a good chance to be noise." -- DAVID functional classification documentations
"Initial Group Members (any value ≥ 2; default = 4): The minimum gene number in a seeding group, which affects the minimum size of each functional group in the final cluster. In general, the lower value attempts to include more genes in functional groups, and may generate a lot of small size groups." -- DAVID functional classification documentations
"Final Group Members (any value ≥ 2; default = 4): The minimum gene number in one final group after a 'cleanup' procedure. In general, the lower value attempts to include more genes in functional groups and may generate a lot of small size groups. It cofunctions with previous parameters to control the minimum size of functional groups. If you are interested in functional groups containing only 2 or 3 genes, you need to set it to a very low value. Otherwise, the small group will not be displayed and the genes will go unclustered." -- DAVID functional classification documentations
"Multi-linkage Threshold (any value between 0% to 100%; default = 50%): This parameter controls how seeding groups merge with each other, i.e. two groups sharing the same gene members over the percentage will become one group. A higher percentage, in general, gives sharper separation (i.e. it generates more final functional groups with more tightly associated genes in each group). In addition, changing the parameter does not cause additional genes to go unclustered." -- DAVID functional classification documentations

Tip

If you do not understand the parameters, stick to the defaults.

Note

The parameters for the Functional Annotation Clustering tool and Functional Classification Tool are the same. These tools apply the same methods and concepts for clustering.

Functional Annotation Table

Provides a gene-centric view which lists the genes and their associated annotation terms... -- DAVID help documents

Functional Annotation Table Combined

Gene Functional Classification Result

The Functional Classification Tool provides a rapid means to organize large lists of genes into functionally related groups to help unravel the biological content captured by high throughput technologies. - DAVID documentation - Functional Classification

Genes are grouped on the basis of shared annotation terms. Similar to the Functional Annotation Clustering tool, this tool uses the kappa statistics and fuzzy heuristic clustering algorithm. "The fuzziness feature of the agglomeration method allows a gene or term to participate in more than one functional group, better reflecting the true 'multiple-roles' nature of genes" (Huang et al. 2007)

Tools are available to investigate consensus terms, enriched terms, and gene-to-term relationships.

Gene Functional Classification vs. Functional Annotation Clustering:

These are related methods. Gene Functional Classification identifies groups of genes sharing similar biological terms, while Functional Annoation Clustering identifies groups of terms sharing similar genes.

DAVID Orthology

The DAVID Orthology tool is a new tool that allows a user to convert a gene list between species.

DAVID Ortholog can convert a gene list from the species under study to a list of orthologs in a targeted species using OMA and Ensembl ortholog pair information. The ortholog lists converted from lesser to better studied species will provide more annotation information thereby helping the user to further understand the gene list under study. - Sherman et al. 2024

The converted gene list can then be used directly with DAVID functional enrichment tools.