01. Retrieving data from NCBI with E Utilities copy
This page uses content directly from the Biostar Handbook by Istvan Albert.
Always remember to start the bioinformatics environment when working on Biostar class material.
conda activate bioinfo
Let's start by creating a directory for class data (if you do not already have one.)
mkdir biostar_class
cd biostar_class
mkdir genbank
cd genbank
ls
pwd
efetch -db nuccore -id NC_001501 -format gb > NC_001501.gb
EFetch is one of NCBI's "E-Utilities" that allows access to NCBI databases from the command line. Each utility (EInfo, ESearch, EPost, EFetch, ELink, EGQuery, ESpell, ECitMatch) has required parameters. They are the gateway to the "Entrez system, which currently includes 38 databases covering a variety of biomedical data, including nucleotide and protein sequences, gene records, three-dimensional molecular structures, and the biomedical literature".
Access to "EFetch" is provided by installation of the "Entrez-Direct" tools, which happened during the computer set-up phase of the course.
Your command line so far...
efetch
efetch --help
A required parameter for EFetch is "db", the database from which to retrieve records. The "nuccore" database is the "nucleotide" database.
Your command line so far...
efetch -db nuccore

Each database entry has a UID, or "unique identifier" (-id). This is the second parameter that must be specified for the EFetch command.
-id NC_001501
-id NC_001501,NC_002549,NC_045512
Your command line so far...
efetch -db nuccore -id NC_001501
RefSeq is the NCBI Reference Sequence database, "a comprehensive, integrated, non-redundant, well-annotated set of reference sequences including genomic, transcript, and protein".
Moving on...what does "-format gb" mean? You can set the format of the returned information. Here, we are interested in GenBank format.
Table 1 from NBK25499, a full list of allowed values for each database
For GenBank format,
-format gb
efetch -db nuccore -id NC_001501 -format gb
efetch -db nuccore -id NC_001501 -format gb > NC_001501.gb
So let's look at the last part of the command line, the output...
> NC_001501.gb
> file_whatever_you_want
Your finished command line.
efetch -db nuccore -id NC_001501 -format gb > NC_001501.gb
efetch -db nuccore -id NC_001501 -format fasta > NC_001501.fa