Practice Lesson 2
For the help sessions, we will work on processing sequences generated in Zhang Z, Feng Q, Li M, Li Z, Xu Q, Pan X, Chen W. Age-Related Cancer-Associated Microbiota Potentially Promotes Oral Squamous Cell Cancer Tumorigenesis by Distinct Mechanisms. Front Microbiol. 2022 Apr 15;13:852566. doi: 10.3389/fmicb.2022.852566. PMID: 35495663; PMCID: PMC9051480.
This study examined differences in the oral microbiome of patients with oral squamous cell cancer. The goal was to determine whether the oral tumor microbiome in young patients was related to disease progression. While this study lacks controls that would make the authors' arguments stronger - for example, it would be nice to see tumor and non-tumor samples as well as healthy controls - the small sample size (20 young and 20 old) makes it fairly easy to reproduce. However, you should not consider this an example of a model experimental design.
Note: the authors did not make the sample information available beyond the sample ids. The metadata provided here resulted from inferring young vs old samples by sample name, as provided in the SRA, alone.
Download the sequences and import for further processing with the QIIME2 platform.
The data is available in the Sequence Read Archive (BioProject PRJNA803155
), so the first step is to grab the data from the SRA. For your convenience, we have also created a compressed archive of the sequence files (/data/practice/PRJNA803155.tar.gz
).
Make a directory called Practice
and unpack this file. You will also need to grab the accession list from the SRA in Step 1. You can skip step 2.
Solution
mkdir Practice
cd Practice
tar -xvf /data/practice/PRJNA803155.tar.gz
Step 1: Get the run info from the SRA
According to the data availability statement, the data can be found in PRJNA803155
.
Change to the Practice
directory created above or make it now. Then make a new directory named raw_data
.
Solution
cd Practice
mkdir raw_data
Get the SRA Accession IDs using e-utilities or from NCBI's Run Selector. Save the file containing the accession IDs to `Practice/sra_id.txt'.
Solution
esearch -db sra -query PRJNA803155 | efetch -format runinfo | cut -f 1 -d ',' |grep "SRR" > sra_id.txt
Step 2: Download the data
Download the data using prefetch
and fasterq-dump
.
Solution
cd raw_data
cat ../sra_id.txt | while read sra_id; do prefetch $sra_id; fasterq-dump $sra_id; gzip ${sra_id}*.fastq;done
What format are the sequences in? How can you import them? See this forum post for guidance.
Step 3: Create the manifest
We will need to use a manifest file to import. See the Import tutorial. Note: The manifest file can be comma separated depending on the format that you use at import, despite what is written in the import tutorial. We will create the manifest file in our Practice
directory.
Solution
cd ~/Practice
echo "sample-id,absolute-filepath,direction" > q2_manifest.csv
cat sra_id.txt | while read sra_id; do echo "$sra_id,$PWD/raw_data/${sra_id}_1.fastq.gz,forward" >> q2_manifest.csv; echo "$sra_id,$PWD/raw_data/${sra_id}_2.fastq.gz,reverse" >> q2_manifest.csv; done
Step 4: Import
To import we will need to keep in mind that our samples are paired-end with quality information and that we are using a manifest format. Note: Phred 64 quality scores are associated with older data, so most data will have quality scores that are Phred 33. For more information on quality scores, see this techical note from Illumina.
We will save our imported data to a new directory named 01_import
.
Solution
mkdir 01_import
qiime tools import \
--type 'SampleData[PairedEndSequencesWithQuality]' \
--input-path q2_manifest.csv \
--output-path 01_import/import.qza \
--input-format PairedEndFastqManifestPhred33
Summarize import
Solution
qiime demux summarize \
--i-data 01_import/import.qza \
--o-visualization 01_import/import.qzv
Again, to view this file, you will need to move it to public
.
Note: It is easier to create the comma separated manifest. However, as you have seen the recommended format is tab separated with the header.
sample-id forward-absolute-filepath reverse-absolute-filepath
To get this to work, you could use the following:
Solution
#Create tab delimited manifest
echo sample-id$'\t'forward-absolute-filepath$'\t'reverse-absolute-filepath > q2_manifest2.tsv
cat sra_id.txt | while read sra_id; do echo $sra_id$'\t'$PWD/raw_data/${sra_id}_1.fastq.gz$'\t'$PWD/raw_data/${sra_id}_2.fastq.gz >> q2_manifest2.tsv; done
#Import
qiime tools import \
--type 'SampleData[PairedEndSequencesWithQuality]' \
--input-path q2_manifest2.tsv \
--output-path 01_import/import2.qza \
--input-format PairedEndFastqManifestPhred33V2