14. Sequence k mers copy

This page uses content directly from the Biostar Handbook by Istvan Albert.

Remember to activate the bioinformatics environment.

conda activate bioinfo

The jellyfish program is dependent on a program called "gcc" which is used to compile the code. First, check to see if you have "gcc" installed. At the command line,

gcc --help | less
If you don't have gcc installed -> try this.
sudo port install gcc9
sudo port select --set gcc mp-gcc9
See if gcc has installed properly by looking at the help document.
gcc --help

Next, we will download and install the jellyfish program for counting k-mers.

Go to Jellyfish

Click on the file jellyfish-2.3.0.tar.gz

The file will automatically be put in your downloads folder.

Now, move the jellyfish-2.3.0.tar.gz to your /biostar_class folder. You can use the "mv" command.

First, go to your Downloads folder, where "username" is your username.

cd /Users/username/Downloads

mv jellyfish-2.3.0.tar.gz /Users/username/biostar_class

Next you'll want to go to your biostar_class directory and decompress the file:

tar xzvf jellyfish-2.3.0.tar.gz
You'll need to cd into the "jellyfish-2.3.0" directory, and then run each of these commands at the command line, one-by-one.
./configure
make
make install

Test the jellyfish installation by typing this at the command line.

jellyfish --help

If you have the following error message with installing "jellyfish" on a Mac (see also this post):

Install: /usr/local/lib/libjellyfish-2.0.2.dylib: Permission denied

From the terminal run the following commands:

sudo chown -R $(whoami) /usr/local/bin
sudo chown -R $(whoami) /usr/local/include
sudo chown -R $(whoami) /usr/local/lib
sudo chown -R $(whoami) /usr/local/share

You may have to enter your password each time.

Then, you can navigate to your jellyfish directory, then do the following steps to install the program:

./configure
make
make install
Did it work? Do you see the help documentation?
jellyfish --help
To install the jellyfish software on your PC, try this:
./configure --prefix=/usr/local
Make
Make install
Also, see help documentation here

A k-mer is a string of nucleotides of length "k". They are substrings contained in a larger string of characters.

(from the Biostar Handbook) For example if the sequence is ATGCA then

  • The 2 base long k-mers (2-mers) are AT, TG, GC and CA
  • The 3 base long k-mers (3-mers) are ATG, TGC and GCA
  • The 4 base long k-mers (4-mers) are ATGC, TGCA
  • The 5 base long k-mer (5-mer) is ATGCA

K-mers are useful in several ways. * rare k-mers may indicate sequencing errors * some k-mers can be used to identify specific genomes * alignment programs can use k-mers to map reads to locations

Since it is faster to compute k-mers than running an alignment - data interpretation can go much faster.

To use the jellyfish k-mer counter, first download some sequence data.

efetch -id KU182908 -db nucleotide -format fasta > KU182908.fa
Use jellyfish to count k-mers.
jellyfish count -C -m 10 -s10M KU182908.fa
* count - both forward and reverse complements of k-mer * where "-C" save only canonical k-mers in the hash (forward or reverse, whichever is found first) * and "-m 10" is the length of the k-mer * using a hash of 10 million elements (-s10M) * output is written to mer_counts.jf by default (or write to a different file name with -o switch)

To create the histogram of k-mers:

jellyfish histo mer_counts.jf

If you want to see those k-mers that appear at least 7 times:

jellyfish dump -L 7 mer_counts.jf

where "dump" subcommand outputs a list of all the k-mers along with their counts. Pick one of the k-mers in the list and see if it is present 7 times.

cat KU182908.fa | dreg -filter -pattern TTAAGAAAAA
Remember, "dreg" is an Emboss tool that searches one or more sequences with the supplied regular expression and writes a report file with the matches.