Database for Annotation, Visualization and Integrated Discovery (DAVID) - An Overview
Lesson 15 review
In Lesson 15, we learned about functional enrichment, pathway analysis, and related concepts. These analyses help us put RNA sequencing results into biological context by informing us of the biomolecular pathways, biological functions, cellular localities, etc. of genes in our study.
Learning objectives
This lesson will provide an overview of the Database for Annotation, Visualization and Integrated Discovery (DAVID). DAVID is one of many tools that can be used to perform functional enrichment analysis.
In this lesson, we will
- Learn more about DAVID
- Available tools
- Underlying statistical methods
- Some expected outputs
- Data size limits
- How to get help
- Run an example and interpret results
Potential Files of Interest:
Files used in this tutorial or related to files used in this tutorial include:
- Up-regulated gene list - up_logfold3.txt
- Background gene list - background_expressed.txt
- Differential Expression Results - hcc1395_deg.csv
Background on DAVID
What does DAVID do?
DAVID is a popular bioinformatics resource system including a web server and web service for functional annotation and enrichment analyses of gene lists. It consists of a comprehensive knowledgebase and a set of functional analysis tools. - Sherman et al. 2022
Over 40 functional categories from dozens of independent public sources (databases) are collected and integrated into the DAVID Knowledgebase. - https://davidbioinformatics.nih.gov/helps/knowledgebase/DAVID_gene.html#coverage
DAVID was created and is maintained by the Laboratory of Human Retrovirology and Immunoinformatics at Frederick National Lab. It has been cited in 72,287 papers since its debut in 2003 as of 23 July 2024.
Tools available in DAVID include:
- Functional Annotation Clustering
- Functional Annotation Chart
- Functional Annotation Table
- Gene Functional Classification
- Gene ID Conversion
- Gene Name Batch Viewer
- Ortholog Tool
With DAVID, we can:
- Identify enriched biological themes, particularly GO terms.
- Discover enriched functional-related gene groups.
- Cluster redundant annotation terms.
- Visualize genes on BioCarta & KEGG pathway maps.
- Display many-genes-to-many-terms relationships in 2D.
- Search for functionally related genes not in the list.
- List interacting proteins.
- Explore gene names in batch mode.
Note
DAVID is not just for gene lists associated with humans; it includes annotations relevant to thousands of species.
Basic statistics behind DAVID
Fisher Exact Test
DAVID performs over representation analysis (ORA) at its core, which aims to find enriched molecular functions, pathways, or other annotations represented by the input gene list.
With DAVID, we are essentially looking at contingency tables (Figure 1). The example in Figure 1 shows the number of user input genes and background genes (selected from the whole genome) that fall onto a particular pathway (ie. p53 signaling). We want to know the probability that the number of user input genes that map to a given pathway is by random chance. In other words, do user input genes fall onto a pathway more often as compared to the background than expected? DAVID uses a modified Fisher exact test to determine whether a pathway is over-represented in our gene list.
DAVID uses an EASE score (default = 1), which puts a more conservative spin on the Fisher's exact test by subtracting from the left hand side of our contingency table.
Figure 1: Contingency table showing the number of user input genes and background genes from the genome that fall onto a certain pathway. DAVID help documentations
Background gene list
DAVID compares the overlap of a user provided gene list to an annotation to the overlap of a background gene list to the same annotation. So what is an appropriate background gene list?
Ideally, a background gene list should represent "the ‘universe’ of possible genes that could be called as significantly regulated in the experiment" (Timmons et al. 2015). In RNA-Seq this would not be the whole genome, but rather the genes that were expressed. A background gene list representing the whole genome would be more appropriate for experiments surveying the entire genome (e.g., genetic variation experiments). See Wijesooriya et al. (2022) for a more detailed discussion.
The default background gene list in DAVID is the whole genome; however, you also have the option to upload your own background gene list.
A word of caution.
If you do use a custom background set, "make sure that the genes in your list are found in the background set that you have selected in DAVID; otherwise, DAVID will ignore them." -- DAVID FAQ
Below are some resources for you to learn about or review this statistical procedure.
- Fisher exact test from Wikipedia
- Hypergeometric distribution from Wikipedia
- Fisher exact test from Pathway Commons
- Hypergeometric distribution from Pathway Commons
- EASE score
Multiple Testing Correction
A problem that arises with enrichment analysis is the need to perform multiple statistical tests across many gene sets. In short, the number of type I errors or false positives increases as the number of tests performed increases -- Pathway Commons multiple testing. With DAVID, users can choose to use either Bonferroni, Benjamini-Hochberg, or False Discovery Rate (FDR) to correct for multiple testing.
Reducing Redundancy
The Functional Annotation Clustering tool can be used to reduce the redundancy evident in the Functional Annotation Chart results. See more below.
Briefly, DAVID uses the Kappa statistic to measure the level of similarities in genes between annotations and then applies fuzzy heuristic clustering to cluster groups of similar annotations.
Data Size Limits
DAVID works best with gene lists comprised of <= 3000 genes, and the Functional Annotation Clustering and Gene Functional Classification tools have a 3000 gene limit.
Getting help
Contact the DAVID Bioinformatics Team via email
Starting an analysis in DAVID
USE GOOGLE CHROME TO INTERACT WITH DAVID
Click on the Start Analysis button to initiate an analysis, this will take us to the Analysis Wizard.
Supplying input
Tasks to do at the Analysis Wizard:
- Provide an input gene list (either copy paste or upload as a text file)
- Specify the gene identifier type. Gene identifiers can be gene symbol, Ensembl IDs, Entrez IDs, Genbank IDs, Refseq IDs, etc.)
Note
DAVID only recognizes “Official Gene Symbol” based on the latest update. Therefore, names changes for "Official Gene Symbols" (e.g., SEPT5 -> SEPTIN5) can impact results.
- Specify whether the input gene list in the "gene list" or "background gene list".
- Submit the list for analysis
Here, we will provide the genes that are upregulated in the Hcc1395 Tumor samples. However, we will not be using results obtained in Module 2, as these were derived from filtered fastqs, which only included reads from Chromosome 22. Instead, we will use up-regulated genes from non-filtered fastqs described here.
Note
These samples were quickly processed using the CCBR pipeliner workflow Renee. The counts file was obtained without checking any of the QC reports to save time.
Up-regulated genes (3,008 total) were obtained by filtering differential expression results obtained from DESeq2 for log2 fold change values greater than or equal to 3 with a false discovery rate of less than or equal to 0.01. This choice was arbitrary, and was primarily driven by DAVID input thresholds. You can download the gene list here.
A background gene list including genes with greater than 0 expression can be found here.
background gene list
For simplicity, I am using the default background gene list in DAVID. However, see the above note regarding appropriate gene lists. You can use the file above (background_expressed.txt) to get an idea concerning how your selection of a background can affect the results.
More on the HCC1395 cell line
The HCC1395 cell line was obtained from a 43 year old Caucasian female patient. The HCC1395 cell line is described as being of tissue orgin: mammary gland; breast/duct. The HCC1395BL cell line was created from a B lymphoblast that was tranformed by the EBV virus. The patient’s cancer was described as: TNM stage I, grade 3, primary ductal carcinoma. This cell line was initiated in the 1990s from a patient with a family history of cancer (patient’s mother had breast cancer). The cell line took 14 months to establish. The patient received chemotherapy prior to isolation of the tumor (PMID: 9833771). This tumor is concidered “Triple-Negative” by classic typing: ERBB2 -ve (aka HER2/neu), PR -ve, and ER -ve). Otherwise it is one of those difficult to classify by expression-based molecular typing but is likely of the “Basal” sub-type (PMID: 22003129). The tumor cell line is known to be polyploid. The tumor is also described as TP53 mutation positive. - Griffith Lab
Step 1: Open the up-regulated genes (up_logfold3.txt) locally; copy the identifiers, and paste the gene list as input in the DAVID Analysis Wizard. Alternatively, attach the file (up_logfold3.txt) directly. Uploaded files must be tab-delimited with one gene id / protein id per row. Then select the appropriate identifier (e.g., ENSEMBL_GENE_ID) and specify the list type (Gene List). As indicated by DAVID, a gene list must be submitted prior to any background.
Note
If you are not sure what type of identifier you are working with, you can select "Not Sure" from the list. This will open the Gene ID Conversion Tool. The GENE ID Conversion Tool is used to convert the input gene list to a set of IDs that are recognized by DAVID in the event that one does not know or DAVID does not recognize the identifier type in the input.
Step 2: After submitting the gene list, DAVID will either proceed directly to the Analysis Wizard, or if more than 20% of input genes were unknown or failed to map to the chosen identifier list, DAVID will proceed to the Gene ID Conversion Tool.
Here, we were directed to the Gene ID Conversion Tool. Because these IDs are Ensembl Gene IDs, I will select "Continue to Submit the IDs That DAVID Could Map".
Ensembl Gene ID
If you are using Ensembl IDs, you will likely need to remove the version ID for many tools to work properly. If working with a gene list you can use the command line tool cut
to do this:
Gene ID Conversion Tool
If we instead decided to convert the identifiers. You would do that as follows:
- Select the Identifier to convert to.
- Designate the species associated with these IDs.
- Select "Submit to Conversion Tool"
Once submitted to the Conversion Tool, can choose to convert all genes or convert each gene individually. Some of the genes may not be in the DAVID database, so the Gene ID conversion tool will not be able to convert those. We would need to do some data wrangling to find identifiers for those genes not in the DAVID database. After conversion, we can send the new list back to DAVID either as input or background. Here, we will submit as input.
2,327 gene IDs were able to map. If we want to check out Unmapped IDs, we can select those using "View Unmapped Ids".
This provides a list of IDs that failed to map. We could explore these IDs further outside of DAVID, for example using BioMart or other tool.
Results and interpretation
Annotation Results Summary
Once we have submitted our gene list for analysis, DAVID takes us to the Annotation Summary Results page. Here, we will find a summary of all of the annotations in DAVID associated with our input gene list.
This page confirms the name of our gene list (up) and, the background gene list that we are using (Homo sapiens), and the number of IDs in our list. We can navigate the Annotation Summary Results page to obtain different insights to our data.
Here we can select and explore multiple classes of annotation categories including GO terms, protein-protein interactions, protein functional domains, disease associations, bio-pathways, sequence general features, functional summaries, tissue expression, literature, etc.
From the Annotation Summary Results, we can see information about the number of genes in our list involved in a given category. The annotations for these genes can be viewed by selecting the horizontal bars.
A functional annotation chart report can be viewed for individual annotation categories by selecting "Chart", or for combined categories (See Functional Annotation Chart for more details on combined reports.)
To better understand our results, let's take a closer look at the disease annotations.
Disease Annotations
DAVID pulls disease annotations from different sources. Clicking on the Chart button will take us to a chart view showing the disease records from a given database in which our input genes map.
- DISGENET
- GAD_DISEASE (GAD = Genetic Association Database)
- GAD_DISEASE_CLASS
- OMIM
- UniProt - UP_KW_DISEASE where KW stands for keyword
Disease Annotation - chart view
In the chart view, we are presented with the disease(s) found in a particular database that our input genes map to. Notice that we can sort this list by gene count, percentage, p-value, or adjusted p-value. The chart view shows us the results of our modified Fisher's exact. More on this later.
Clicking on one of the disease terms in the list will provide a description of the term. The source of this description will vary depending on the category selected. For example, terms under DISGENET send us to NCBI's MedGen.
Note
The chart view for some annotation databases may not have records, or will have few records. This is because there were no annotations that met the statistical threshold (Default = 0.1).
Here, we can also look at related terms by selecting "RT", and we can access a Gene Report by selecting the blue bar.
Related Terms:
Examining the related terms can help you identify related biological processes or terms to get a better idea of the underlying biology. A kappa statistic is used to measure the degree of agreement between participating genes in terms. The closer to 1, the greater the similarity. Greater than 0.7 indicates a strong agreement..
Disease Annotation - link to external resource
In the chart view shown above, we clicked on mammary neoplasm, and this took us to the corresponding record in NCBI's MedGen. MedGen is NCBI's database that contains organized information pertaining to human gene-disease relationships.
Disease Annotation - gene view
If we click on the blue bar next to the chart button, we will be taken to a gene view of the disease terms.
Disease Annotation - gene view results
The gene view lists the disease terms for each gene in the annotation category of interest. This is the Functional Annotation Table.
Other annotations
We see a similar organization of information for other annotation categories. For instance, DAVID pulls information on biomolecular processes, functions, and pathways from several sources such as UniProt, GO, KEGG, Reactome, and Wikipathways.
For Gene Ontology (GO), the GO Direct categories are selected by default. These provide GO mappings directly annotated by the source database (no parent terms included). The user can also opt for all levels from GO or specific levels, with level one including broader terms and level 5 including more specific terms. The GO FAT category filters out very broad GO terms based on a measured specificity of each term (not level-specificity).
We can look at results across multiple categories by selecting annotation sources (databases) of interest. By Default, DAVID will select annotation sources in red, but these can also be deselected.
Pathway Maps
DAVID includes several pathway databases, and also includes a Pathway Viewer for annotations from KEGG, BioCarta, and WikiPathways. This viewer displays the user's genes on pathway maps.
BioCarta
Pathway information generated by BioCarta is no longer maintained by BioCarta or CGAP but is retained in DAVID.
DAVID Pathway Viewer: Proteoglygans in Cancer from KEGG
DAVID Pathway Viewer: Pleural mesothelioma from WikiPathways
Functional Annotation Chart
Chart Report is an annotation term focused view which lists annotation terms and their associated genes under study. -- DAVID help documents
As we have seen, you can view the Functional Annotation Chart for specific annotation categories. However, the "Functional Annotation Chart button" provides our enrichment results across multiple categories. Without customizing which categories (or databases) to include, DAVID will automatically use pre-selected defaults. This view does not eliminate redundancies across annotation categories and databases.
Chart results have to meet certain criteria to be included:
- EASE Score Threshold (Maximum Probability, p-value) <= 0.1
- Count Threshold (Minimum gene count for an annotation term) >= 2
These thresholds are customizable using "Options".
You can also add additional columns to the output under "Display". For example, you can look at other methods for adjusting p-values, fold enrichment, Fisher's Exact p-value, etc.
Functional Annotation Clustering
The Functional Annotation Clustering tool groups similar annotations together to reduce the redundancy seen in the Functional Annotation Chart results. This eases the interpretation of the findings.
The Functional Annotation Clustering integrates the same techniques of Kappa statistics to measure the degree of the common genes between two annotations, and fuzzy heuristic clustering (used in Gene Functional Classification Tool) to classify the groups of similar annotations according to Kappa values. In this sense, the more common genes annotations share, the higher chance they will be grouped together. - DAVID Documentation - Functional Annotation Clustering
For a more in-depth understanding of the methods used, see Huang et al. 2007.
What is the Group Enrichment Score?
This is the geometric mean (in -log scale) of member's p-values in a corresponding annotation cluster. More enriched terms will be toward the top with higher group enrichment scores. This is used to rank biological significance.
Gene-Annotation Association View:
The Gene-Annotation Association Viewer can be used to examine the relationships between genes and annotation terms. Green indicates a term is associated with a given gene. We can use this to look at genes shared across terms.
Modifying Parameters:
We can fine tune how DAVID clusters the annotations using the parameters below.
Adjusting parameters for clustering
- "Clustering Stringency (lowest → highest): A high-level single control to establish a set of detailed parameters involved in functional classification algorithms. In general, the higher stringency setting generates less functional groups with more tightly associated genes in each group, so that more genes will be unclustered. The default setting is Medium, which gives balanced results for most cases based on our studies. Customization allows you to control Advanced options." -- DAVID functional classification documentations
- "Similarity Term Overlap (any value ≥ 0; default = 4): The minimum number of annotation terms overlapped between two genes in order to be qualified for kappa calculation. This parameter is to maintain necessary statistical power to make the kappa value more meaningful. The higher the value, the more meaningful the result is." -- DAVID functional classification documentations
- "Similarity Threshold (any value between 0 to 1; default = 0.35): The minimum kappa value to be considered significant. A higher setting will lead to more genes going unclustered, which leads to a higher quality functional classification result with fewer groups and fewer gene members. Kappa value of 0.3 starts giving meaningful biology based on our genome-wide distribution study. Anything below 0.3 has a good chance to be noise." -- DAVID functional classification documentations
- "Initial Group Members (any value ≥ 2; default = 4): The minimum gene number in a seeding group, which affects the minimum size of each functional group in the final cluster. In general, the lower value attempts to include more genes in functional groups, and may generate a lot of small size groups." -- DAVID functional classification documentations
- "Final Group Members (any value ≥ 2; default = 4): The minimum gene number in one final group after a 'cleanup' procedure. In general, the lower value attempts to include more genes in functional groups and may generate a lot of small size groups. It cofunctions with previous parameters to control the minimum size of functional groups. If you are interested in functional groups containing only 2 or 3 genes, you need to set it to a very low value. Otherwise, the small group will not be displayed and the genes will go unclustered." -- DAVID functional classification documentations
- "Multi-linkage Threshold (any value between 0% to 100%; default = 50%): This parameter controls how seeding groups merge with each other, i.e. two groups sharing the same gene members over the percentage will become one group. A higher percentage, in general, gives sharper separation (i.e. it generates more final functional groups with more tightly associated genes in each group). In addition, changing the parameter does not cause additional genes to go unclustered." -- DAVID functional classification documentations
Tip
If you do not understand the parameters, stick to the defaults.
Note
The parameters for the Functional Annotation Clustering tool and Functional Classification Tool are the same. These tools apply the same methods and concepts for clustering.
Functional Annotation Table
Provides a gene-centric view which lists the genes and their associated annotation terms... -- DAVID help documents
Gene Functional Classification Result
The Functional Classification Tool provides a rapid means to organize large lists of genes into functionally related groups to help unravel the biological content captured by high throughput technologies. - DAVID documentation - Functional Classification
Genes are grouped on the basis of shared annotation terms. Similar to the Functional Annotation Clustering tool, this tool uses the kappa statistics and fuzzy heuristic clustering algorithm. "The fuzziness feature of the agglomeration method allows a gene or term to participate in more than one functional group, better reflecting the true 'multiple-roles' nature of genes" (Huang et al. 2007)
Tools are available to investigate consensus terms, enriched terms, and gene-to-term relationships.
Gene Functional Classification vs. Functional Annotation Clustering:
These are related methods. Gene Functional Classification identifies groups of genes sharing similar biological terms, while Functional Annoation Clustering identifies groups of terms sharing similar genes.
DAVID Orthology
The DAVID Orthology tool is a new tool that allows a user to convert a gene list between species.
DAVID Ortholog can convert a gene list from the species under study to a list of orthologs in a targeted species using OMA and Ensembl ortholog pair information. The ortholog lists converted from lesser to better studied species will provide more annotation information thereby helping the user to further understand the gene list under study. - Sherman et al. 2024
The converted gene list can then be used directly with DAVID functional enrichment tools.