04. How to make a custom genome and view in IGV copy

This page uses content directly from the Biostar Handbook by Istvan Albert.

Remember to activate the bioinfo environment.

conda activate bioinfo

Then create a new directory for files we will be working with today in your biostar_class directory.

cd biostar_class
mkdir igv_data
cd igv_data

Today we are going to use IGV, the Integrative Genomics Viewer by the Broad Institute, to visualize a genome and the features of that genome.

We will work with a genome that is not already in IGV, so we need to download and prepare the data files. 1. You should have IGV downloaded and working on your desktop. 2. Double-click the IGV icon to load IGV. 3. Use the box in the upper left corner of the IGV window to see all the genomes preloaded. Is the Ebola virus genome in the list?

Now minimize the IGV window while we prepare the sequence data.

To view this genome in IGV, we will need to: 1. Identify the genome file. (FASTA) 2. Create or find the genome feature file. (GFF3) 2. Create an index for the genome. (FAI) 3. Load the genome into IGV.

We'll start by retrieving a file from GenBank/NCBI. Remember you can always go to the NCBI web page and use the Graphical User Interface (GUI) to retrieve the same information. But here we will do everything at the command line.

efetch -db nucleotide -format gb -id AF086833 > AF086833.gb

This command line should look familiar. What is it doing?

Did you notice, the database is "nucleotide", not "nuccore"? Are they interchangeable?

Screen Shot 2020-05-04 at 6 07 05 PM

(Nucleic Acids Research, Volume 42, Issue D1, 1 January 2014, Pages D7–D17, Database resources of the National Center for Biotechnology Information) "GenBank is the primary nucleotide sequence archive at NCBI and is a member of the INSDC. Sequences from GenBank are available from three Entrez databases: Nucleotide, EST and GSS (specified as nuccore, nucest and nucgss within the E-utilities). The Nucleotide database contains all GenBank sequences except those within the EST or GSS GenBank divisions. The database also contains WGS sequences, Third Party Annotation (TPA) sequences and sequences imported from the Structure database."

Apparently, both "nucleotide" and "nuccore" will work to retrieve from NCBI nucleotide database.

Convert the GenBank file into GFF3. What information in the GenBank file is used in the GFF3-formatted file?

Look at the GenBank file with "less".

less AF086833.gb

Do you see any "gene features" such as coding sequence (CDS), regulatory regions, and upstream UTR locations?

 5'UTR           1..55
                     /note="putative leader region"
                     /citation=[5]
                     /function="regulation or initiation of RNA replication"
     gene            56..3026
                     /gene="NP"
     mRNA            56..3026
                     /gene="NP"
                     /product="nucleoprotein"
     regulatory      56..67
                     /regulatory_class="other"
                     /gene="NP"
                     /note="putative; transcription start signal"
                     /citation=[5]
     CDS             470..2689
                     /gene="NP"
                     /function="encapsidation of genomic RNA"
                     /codon_start=1
                     /product="nucleoprotein"
                     /protein_id="AAD14590.1"
                     /translation="MDSRPQKIWMAPSLTESDMDYHKILTAGLSVQQGIVRQRVIPVY
                     QVNNLEEICQLIIQAFEAGVDFQESADSFLLMLCLHHAYQGDYKLFLESGAVKYLEGH
                     GFRFEVKKRDGVKRLEELLPAVSSGKNIKRTLAAMPEEETTEANAGQFLSFASLFLPK
                     LVVGEKACLEKVQRQIQVHAEQGLIQYPTAWQSVGHMMVIFRLMRTNFLIKFLLIHQG
                     MHMVAGHDANDAVISNSVAQARFSGLLIVKTVLDHILQKTERGVRLHPLARTAKVKNE

The gene features from the GenBank file are re-formatted to GFF3 format by the "seqret" program. Could you create a GFF3 (Gene Feature Format) file from a FASTA file? How about a FASTQ file?

cat AF086833.gb | seqret -filter -feature -osformat gff3 > AF086833.gff

cat -> The cat utility reads files sequentially, writing them to the standard output.

pipe -> the "|" symbol - piping the results from "cat" into "seqret". Could also be done in two steps like this. Using the "pipe" allows us to combine the commands into one step, one command line, and is more efficient.

cat AF086833.gb > sequence.output
seqret -sequence sequence.output -filter -feature -osformat gff3 > AF086833.gff

Also can be done a third way.

seqret -sequence AF086833.gb -filter -feature -osformat gff3 > AF086833.gff

In Unix, "there's more than one way to do it."

What do these seqret options mean (-filter, -feature, -osformat)?

seqret --help

-feature (Use feature information)

Just using "--help" does not get us all the information.

seqret --help --verbose

-filter (Read first file from standard input, write first file to standard output.)

-osformat (Output seq format)

What is GFF format?

GFF -> Genome Feature Format

GFF3 -> version 3

GFF3 is a file format for storing genomic features in a text file. GFF files are plain text, nine column, tab-delimited files.

Fields must be tab-separated. Also, all but the final field in each feature line must contain a value; "empty" columns should be denoted with a '.' The first line of a GFF3 file must be a comment that identifies the version, e.g.

##gff-version 3

seqid - name of the chromosome or scaffold; chromosome names can be given with or without the 'chr' prefix. Important note: the seq ID must be one used within Ensembl, i.e. a standard chromosome name or an Ensembl identifier such as a scaffold ID, without any additional content such as species or assembly. See the example GFF output below.
source - name of the program that generated this feature, or the data source (database or project name)
type - type of feature. Must be a term or accession from the SOFA sequence ontology
start - Start position of the feature, with sequence numbering starting at 1.
end - End position of the feature, with sequence numbering starting at 1.
score - A floating point value.
strand - defined as + (forward) or - (reverse).
phase - One of '0', '1' or '2'. '0' indicates that the first base of the feature is the first base of a codon, '1' that the second base is the first base of a codon, and so on..
attributes - A semicolon-separated list of tag-value pairs, providing additional information about each feature.

Use "less" to peek inside the gff file. What do you see?

less AF086833.gff

Something like this. Is this tab-delimited? Does it have nine columns? Is the first line the version number of gff?

##gff-version 3
##sequence-region AF086833 1 18959
#!Date 2020-05-04
#!Type DNA
#!Source-version EMBOSS 6.6.0.0
AF086833        EMBL    databank_entry  1       18959   .       +       .       ID=AF086833.1;organism=Ebola virus - Mayinga%2C Zaire%2C 1976;mol_type=viral cRNA;strain=Mayinga;db_xref=taxon:128952
AF086833        EMBL    five_prime_UTR  1       55      .       +       .       ID=AF086833.2;note=putative leader region;citation=[5];function=regulation or initiation of RNA replication

Okay -so we've created a GFF(.gff) file from the GenBank (.gb) file. Now, we need the genome and the genome index file, so we can load everything into IGV.

The genome file needs to be in FASTA format. We can get that by converting the GenBank file into FASTA using the "seqret" program. So, the "seqret" program can be used to convert Genbank files (.gb) to (.gff) files OR (.fasta) files (and more).

cat AF086833.gb | seqret -filter -feature -osformat fasta > AF086833.fa

Next, we need to create an index of the FASTA file. What is indexing and why do we have to do it? Picture a book index, which helps you find specific regions in the book without having to look at every page. Indexing a genome file is a similar idea. An indexed genome file helps the aligner find sequences while reducing both time and memory requirements.

samtools is a set of utilities used to manipulate next generation sequencing data files. samtools "faidx" is used to index regions of a FASTA or FASTQ file.

The index file typically has the same filename as the corresponding FASTA/FASTQ file, with .fai appended. An fai index file is a text file consisting of lines each with five TAB-delimited columns for a FASTA file and six for FASTQ:

NAME Name of this reference sequence
LENGTH Total length of this reference sequence, in bases
OFFSET Offset in the FASTA/FASTQ file of this sequence's first base
LINEBASES The number of bases on each line
LINEWIDTH The number of bytes in each line, including the newline
QUALOFFSET Offset of sequence's first quality within the FASTQ file

samtools faidx AF086833.fa

Now take a look at the index file (.fai). Do you see tab-delimited columns? How many?

cat AF086833.fa.fai | head
AF086833 18959 72 60 61

So - we've got all three files we need, the fasta file, the index file and the gff file. Let's go back to IGV.

Open IGV (double-click icon) Genomes -> Load Genome from File (maneuver to your biostar_class folder, find AF086833.fa)

File -> Load from File (maneuver to your biostar_class folder, find AF086833.gff)

Can you reproduce this image?

How?

Screen Shot 2020-05-04 at 5 36 13 PM

Screen Shot 2020-05-04 at 5 36 35 PM