3.nf core rnaseq
nf-core/rnaseq
nf-core/rnaseq is a bioinformatics pipeline that can be used to analyse RNA sequencing data obtained from organisms with a reference genome and annotation. It takes a samplesheet and FASTQ files as input, performs quality control (QC), trimming and (pseudo-)alignment, and produces a gene expression matrix and extensive QC report. For full details of the pipeline users are referred to the official nf-core/rnaseq web pages. These pages describe the overall strategies and functinality behind this pipeline. They contain information about running this pipeline in many different environments, but does not provide specifics about the current DNAnexus implementation. This later information can be obtained here add link to dnanexus pages
DNAnexus Implementation of the nf-core/rnaseq pipeline
Because of the nature of the DNAnexus environment, and the users interaction with it through its GUI interface, the DNAnexus implementation has a number of unique features. To this end the basic pipeline has been wrapped within a separate process which simplifies the users interaction with the program (both with repect to input and output). This wrapper makes certain decissions about how the pipeline will be run. This simplifies the user interaction, while sacrificing some of its versitility. This "simplified" version should be suitable for most users. However, for expert users a version providing full access to all parameters for the progam is also available.
- The input sample-sheet, which specifies the fastq files to be processed by the pipeline, must be customized to adhere to DNAnexus naming conventions. This is taken care of by the wrapper program and the user must merely provide the list of files to be processed. If the strandeness of the data is known it can be entered, or if unknow the auto parameter should be selected, which will have the pipeline automatically select the most appropriate standedness. Additionally, the user must indicate how the read1 and read2 data is designated.[Unfortunately, there is no universally accepted nomenclature for distinguishing read1/read2 files].
- The DNAnexus implementation provides access to prebuilt genome data (including the necessary index files) for the most commonly used genomes (Human and Mouse). These are selected from a simple pull-down menu.
-
This pipeline produces copious output, some of which is only relevant under special circumstances which are not invoked by the standard version in DNAnexus (see the nf-core/rnaseq output documentation). In general the output can be grouped under a number of different headings, as indicateed below. However, the user should be aware that the wrapper program produces a single summary.html file that provides ready access to an annotated version of the most relevent output.
-
QC - The pipelines produces alot of QC data generated by several different programs. Fortunately, all this data has been colated by the "multiQC program" and can be accessed by viewing the file /multiqc/star_salmon/multiqc_report.html
- Count Matrices - In order to provide the greatest compatablitiy with downstream differential expression programs the gene/transcript count data is reported in several different formats.
Gene Level Counts
salmon.merged.gene_tpm.tsv - normalized counts as (TPM - transcripts per million)
salmon.merged.gene_counts.rds - R data table for raw counts
salmon.merged.gene_counts.tsv - Raw counts - tab separated
salmon.merged.gene_counts_length_scaled.rds - R data table
salmon.merged.gene_counts_length_scaled.tsv - tab separated
salmon.merged.gene_counts_scaled.rds - R data table
salmon.merged.gene_counts_scaled.tsv - tab separated
Transcript Level Counts
salmon.merged.transcript_tpm.tsv - normalized counts as (TPM - transcripts per million)
salmon.merged.transcript_counts.rds - R data table for raw counts
salmon.merged.transcript_counts.tsv - Raw counts - tab separated
Annotations
salmon_tx2gene.tsv - table relating Ensemble IDs to Gene Names
- BAM/BigWig Files
- Pipeline Logs