Practice Lesson 2

For the help sessions, we will work on processing sequences generated in Zhang Z, Feng Q, Li M, Li Z, Xu Q, Pan X, Chen W. Age-Related Cancer-Associated Microbiota Potentially Promotes Oral Squamous Cell Cancer Tumorigenesis by Distinct Mechanisms. Front Microbiol. 2022 Apr 15;13:852566. doi: 10.3389/fmicb.2022.852566. PMID: 35495663; PMCID: PMC9051480.

This study examined differences in the oral microbiome of patients with oral squamous cell cancer. The goal was to determine whether the oral tumor microbiome in young patients was related to disease progression. While this study lacks controls that would make the authors' arguments stronger - for example, it would be nice to see tumor and non-tumor samples as well as healthy controls - the small sample size (20 young and 20 old) makes it fairly easy to reproduce. However, you should not consider this an example of a model experimental design.

Note: the authors did not make the sample information available beyond the sample ids. The metadata provided here resulted from inferring young vs old samples by sample name, as provided in the SRA, alone.

Download the sequences and import for further processing with the QIIME2 platform.

The data is available in the Sequence Read Archive (BioProject PRJNA803155), so the first step is to grab the data from the SRA. For your convenience, we have also created a compressed archive of the sequence files (/data/practice/PRJNA803155.tar.gz).

Make a directory called Practice and unpack this file. You will also need to grab the accession list from the SRA in Step 1. You can skip step 2.

Solution

mkdir Practice   
cd Practice  
tar -xvf /data/practice/PRJNA803155.tar.gz

Step 1: Get the run info from the SRA

According to the data availability statement, the data can be found in PRJNA803155.

Change to the Practice directory created above or make it now. Then make a new directory named raw_data.

Solution

cd Practice  
mkdir raw_data

Get the SRA Accession IDs using e-utilities or from NCBI's Run Selector. Save the file containing the accession IDs to `Practice/sra_id.txt'.

Solution

esearch -db sra -query PRJNA803155 | efetch -format runinfo | cut -f 1 -d ',' |grep "SRR" > sra_id.txt

Step 2: Download the data

Download the data using prefetch and fasterq-dump.

Solution

cd raw_data  
cat ../sra_id.txt | while read sra_id; do prefetch $sra_id; fasterq-dump $sra_id; gzip ${sra_id}*.fastq;done

What format are the sequences in? How can you import them? See this forum post for guidance.

Step 3: Create the manifest

We will need to use a manifest file to import. See the Import tutorial. Note: The manifest file can be comma separated depending on the format that you use at import, despite what is written in the import tutorial. We will create the manifest file in our Practice directory.

Solution

cd ~/Practice  

echo "sample-id,absolute-filepath,direction" > q2_manifest.csv  

cat sra_id.txt | while read sra_id; do echo "$sra_id,$PWD/raw_data/${sra_id}_1.fastq.gz,forward" >> q2_manifest.csv; echo "$sra_id,$PWD/raw_data/${sra_id}_2.fastq.gz,reverse" >> q2_manifest.csv; done

Step 4: Import

To import we will need to keep in mind that our samples are paired-end with quality information and that we are using a manifest format. Note: Phred 64 quality scores are associated with older data, so most data will have quality scores that are Phred 33. For more information on quality scores, see this techical note from Illumina.

We will save our imported data to a new directory named 01_import.

Solution

mkdir 01_import
qiime tools import \
--type 'SampleData[PairedEndSequencesWithQuality]' \
--input-path q2_manifest.csv \
--output-path 01_import/import.qza \
--input-format PairedEndFastqManifestPhred33

Summarize import

Solution

qiime demux summarize \
--i-data 01_import/import.qza \
--o-visualization 01_import/import.qzv

Again, to view this file, you will need to move it to public.

Note: It is easier to create the comma separated manifest. However, as you have seen the recommended format is tab separated with the header.

sample-id   forward-absolute-filepath   reverse-absolute-filepath

To get this to work, you could use the following:

Solution

#Create tab delimited manifest

echo sample-id$'\t'forward-absolute-filepath$'\t'reverse-absolute-filepath > q2_manifest2.tsv  

cat sra_id.txt | while read sra_id; do echo $sra_id$'\t'$PWD/raw_data/${sra_id}_1.fastq.gz$'\t'$PWD/raw_data/${sra_id}_2.fastq.gz >> q2_manifest2.tsv; done  

#Import  
qiime tools import \
--type 'SampleData[PairedEndSequencesWithQuality]' \
--input-path q2_manifest2.tsv \
--output-path 01_import/import2.qza \
--input-format PairedEndFastqManifestPhred33V2