Lesson 4: Feature table filtering, taxonomic classification, and phylogeny

Learning objectives

learn how to apply different types of filtering to your ASV table and representative sequence data.
classify your ASVs.
Generate a phylogenetic tree.

Now that we have imported and denoised, let's move on to feature table filtering, taxonomic classification, and phylogenetic tree construction. For this lesson, we will use a much larger feature table that contains the entire data set from Taur et al. 2018 (See the QIIME2 Cancer Microbiome Tutorial). This will make the analysis more interesting.

We are going to skip a few filtering steps compared with the QIIME2 tutorial. There are great filtering functions in QIIME2. We will apply some of these, but you should have a look at the filtering tutorial. We will work directly with the autoFMT study and will skip the filtering step that returns only this group (--p-where 'autoFmtGroup IS NOT NULL').

Filtering

Methods of filtering

Frequency based filtering, contingency based filtering, metadata based filtering
qiime feature-table filter-samples
- Filter samples based on total number of sequences (e.g., --p-min-frequency 1000)
- Filter samples with a minimal number of features (e.g., --p-min-features 10)
- Metadata based filtering:
  - Retain only samples present in the metadata (e.g., --m-metadata-file samples-to-keep.tsv)
  - Retain samples based on a metadata description (e.g., --p-where "[subject]='subject-1'" ). Uses SQLite WHERE-clause syntax (Use IN, AND, OR, AND NOT).
qiime feature-table filter-features
- Remove features (ASVs) with low abundances after summing across all samples (e.g., --p-min-frequency 10)
- Remove features found in only a small number of samples (e.g., --p-min-samples 2)
Taxonomy based filtering (qiime taxa filter-table, qiime taxa filter-seqs) - MORE ON THIS BELOW
Distance matrix filtering (qiime diversity filter-distance-matrix)

The filtering you apply will depend on your data and your questions. I tend to filter out low abundant taxa and depending on my question I may also filter taxa with low prevalence.

Metadata based filtering

We are going to eliminate some samples based on the timing of collection. We are interested in only the samples collected between ten days prior to their hematopoietic cell transplantation and seventy days post transplantation.

To do this, let's use the --p-where option of qiime feature-table filter-samples

qiime feature-table filter-samples \
  --i-table /data/autofmt-table.qza \
  --m-metadata-file /data/sample-metadata.tsv \
  --p-where 'DayRelativeToNearestHCT BETWEEN -10 AND 70' \
  --o-filtered-table filtered-table-1.qza

Notice the SQLite syntax with key words BETWEEN and AND. If confused by the syntax, check out the filtering tutorial for other examples.

The next two steps will primarily focus on reducing run times. You may or may not want to include them in your own analysis workflow.

Feature filtering

We can filter features observed in only a single sample.

qiime feature-table filter-features \
  --i-table filtered-table-1.qza \
  --p-min-samples 2 \
  --o-filtered-table filtered-table-2.qza

Unless we explicitly remove them, removed features will remain in our FeatureData (Representative Sequences). This isn't necessarily a problem, but eliminating them can reduce computation time when classifying our sequences and generating a phylogenetic tree. So, we will do that here. Also, because we are starting from a larger data set in this part of the tutorial, we will need a new, more comprehensive representative features table.

Let's use wget to get the new representative sequences.

wget \
  -O 'rep-seqs.qza' \
  'https://docs.qiime2.org/jupyterbooks/cancer-microbiome-intervention-tutorial/data/030-tutorial-downstream/010-filtering/rep-seqs.qza'

qiime feature-table filter-seqs \
  --i-data rep-seqs.qza \
  --i-table filtered-table-2.qza \
  --o-filtered-data filtered-sequences-1.qza

Let's summarize our newly filtered tables.

qiime feature-table summarize \
  --i-table filtered-table-2.qza \
  --m-sample-metadata-file /data/sample-metadata.tsv \
  --o-visualization filtered-table-summ.qzv
qiime feature-table tabulate-seqs \
  --i-data filtered-sequences-1.qza \
  --o-visualization filt-asv-sequences-summ.qzv

Taxonomy

After denoising, we have unique ASVs or sequences, which in itself can be quite revealing. However, often we want to classify our sequences to know more about the organisms represented by our ASVs, particularly regarding evolutionary relationships. In QIIME2, we can classify our representative sequences using the q2-feature-classifier plugin and generate a phylogenetic tree via de novo or reference based methods.

Alignment based taxonomy consensus classifiers:

BLAST+ - local sequence alignment followed by consensus taxonomy classification

VSEARCH - global sequence alignment followed by consensus taxonomy classification

These essentially align sequences to references and take the top matches (maxaccepts) above some threshold of similarity (perc_identity) and then assign taxonomy by how well the top matches agree (min_consensus).

maxaccepts - the maximum number of hits to keep for each query
perc_identity - the percentage of similarity below which a match is rejected
min_consensus - the minimum fraction of assignments that must match the top hits from maxaccepts to be accepted as consensus assignment

A nice description of these can be found in Bokulich, N.A., Kaehler, B.D., Rideout, J.R. et al. Optimizing taxonomic classification of marker-gene amplicon sequences with QIIME 2’s q2-feature-classifier plugin. Microbiome 6, 90 (2018). https://doi.org/10.1186/s40168-018-0470-z.

Machine learning approach

The recommended method involves applying a scikit-learn multinomial naive Bayes classifier. This uses machine learning to train a classifier on a referene database and then use the trained classifier to classify your ASVs.

These classifiers work best if they are trained on your target region of interest (Werner et al. 2012), in this case V4-V5. While there are some pre-trained classifiers in the QIIME 2 Data Resources, the classifier used here was created according to the Training feature classifiers with q2-feature-classifier tutorial using Greengenes v 13.8.

The code to train the classifier

qiime tools import \
  --type 'FeatureData[Sequence]' \
  --input-path 99_otus.fasta \
  --output-path 99_otus.qza

qiime tools import \
  --type 'FeatureData[Taxonomy]' \
  --input-format HeaderlessTSVTaxonomyFormat \
  --input-path 99_otu_taxonomy.txt \
  --output-path ref-taxonomy.qza

qiime feature-classifier extract-reads \
  --i-sequences 99_otus.qza \
  --p-f-primer AYTGGGYDTAAAGNG \
  --p-r-primer CCGTCAATTYHTTTRAGT \
  --p-min-length 100 \
  --p-max-length 400 \
  --p-n-jobs 10 \
  --o-reads ref-seqs.qza

qiime feature-classifier fit-classifier-naive-bayes \
  --i-reference-reads ref-seqs.qza \
  --i-reference-taxonomy ref-taxonomy.qza \
  --o-classifier gg-13-8-99-563-926-nb-classifier.qza

Note: Files to create a train classifier are not available in DNAnexus. This code is simply for example.

We will use our trained classifier to annotate our ASVs.

qiime feature-classifier classify-sklearn \
  --i-classifier /data/gg-13-8-99-563-926-nb-classifier.qza \
  --i-reads filtered-sequences-1.qza \
  --o-classification taxonomy.qza

And generate a summary of the results:

qiime metadata tabulate \
  --m-input-file taxonomy.qza \
  --o-visualization taxonomy.qzv

Note: You can increase classification accuracy by providing taxonomic weights for certain common sample types (See QIIME 2 clawback)

Taxonomic based filtering

By targeting 16S rRNA, we want to target bacteria and archaea. Therefore, we can exclude sequences that are unexpected such as those from chloroplasts or mitochondria. By setting --p-include p__, we are retaining only sequences annotated at a minimum to the phylum level. Note: this will look different depending on the database used. Greengenes specifically uses the following format for annotations: k__;p__;c__;o__;f__;g__;s__. Also, --p-mode contains ensures that search terms are case insensitve (e.g., mitochondria versus Mitochondria).

qiime taxa filter-table \
  --i-table filtered-table-2.qza \
  --i-taxonomy taxonomy.qza \
  --p-mode contains \
  --p-include p__ \
  --p-exclude 'p__;,Chloroplast,Mitochondria' \
  --o-filtered-table filtered-table-3.qza

Let's also remove these from our feature data to save computational time later.

qiime feature-table filter-seqs \
  --i-data filtered-sequences-1.qza \
  --i-table filtered-table-3.qza \
  --o-filtered-data filtered-sequences-2.qza

Visualizing our taxonomy

We can visualize sample by sample taxonomic composition using a stacked bar plot generated with qiime taxa barplot. Let's take a look.

qiime taxa barplot \
  --i-table filtered-table-3.qza \
  --i-taxonomy taxonomy.qza \
  --m-metadata-file /data/sample-metadata.tsv \
  --o-visualization taxa-bar-plots-1.qzv

Phylogeny

In addition to clasifying our organisms, we also want to reconstruct their phylogenetic relationships by generating a phylogenetic tree. We often assume that phylogenetic closeness can elucidate commonalities in phenotypic properties / functions, so it is worth examining. Moreover, we will need these trees later for alpha and beta diversity measures.

We can generate a phylogenetic tree using:
- de novo methods - constructed without a reference database
- reference based methods - uses a reference database
- SEPP fragment insertion will insert sequences into a reference phylogeny, with the caveat that sequences not at least 75% similar to any sequence in the tree is discarded. (This is a nice solution for meta analyses, but can be computationally intensive.)

For more details, refer to the QIIME 2 documentation.

We will perform a de novo method here, using a pipeline available in the q2-phylogeny plugin (qiime phylogeny align-to-tree-mafft-fasttree).

This will run through the following steps:

generates a multiple sequence alignment
masks the alignment (removes errors, uniformative sites, repetitive regions, which can lose some resolution)
builds a tree [de novo methods output an unrooted tree, which lacks directionality (no inferrred ancestry)]
root the tree (a rooted tree is required for alpha and beta diversity metrics) - the unrooted tree is rooted at its midpoint

Example of unrooted vs rooted (at midpoint):

Image from The Cabbages of Doom, How to root a phylogenetic tree

Let's run the pipeline:

qiime phylogeny align-to-tree-mafft-fasttree \
  --i-sequences filtered-sequences-2.qza \
  --output-dir phylogeny-align-to-tree-mafft-fasttree

There is not visual output to view here.