14. Sequence k mers copy
This page uses content directly from the Biostar Handbook by Istvan Albert.
Remember to activate the bioinformatics environment.
conda activate bioinfo
The jellyfish program is dependent on a program called "gcc" which is used to compile the code. First, check to see if you have "gcc" installed. At the command line,
gcc --help | less
sudo port install gcc9
sudo port select --set gcc mp-gcc9
gcc --help
Next, we will download and install the jellyfish program for counting k-mers.
Go to Jellyfish
Click on the file jellyfish-2.3.0.tar.gz
The file will automatically be put in your downloads folder.
Now, move the jellyfish-2.3.0.tar.gz to your /biostar_class folder. You can use the "mv" command.
First, go to your Downloads folder, where "username" is your username.
cd /Users/username/Downloads
mv jellyfish-2.3.0.tar.gz /Users/username/biostar_class
Next you'll want to go to your biostar_class directory and decompress the file:
tar xzvf jellyfish-2.3.0.tar.gz
./configure
make
make install
Test the jellyfish installation by typing this at the command line.
jellyfish --help
If you have the following error message with installing "jellyfish" on a Mac (see also this post):
Install: /usr/local/lib/libjellyfish-2.0.2.dylib: Permission denied
From the terminal run the following commands:
sudo chown -R $(whoami) /usr/local/bin
sudo chown -R $(whoami) /usr/local/include
sudo chown -R $(whoami) /usr/local/lib
sudo chown -R $(whoami) /usr/local/share
You may have to enter your password each time.
Then, you can navigate to your jellyfish directory, then do the following steps to install the program:
./configure
make
make install
jellyfish --help
./configure --prefix=/usr/local
Make
Make install
A k-mer is a string of nucleotides of length "k". They are substrings contained in a larger string of characters.
(from the Biostar Handbook) For example if the sequence is ATGCA then
- The 2 base long k-mers (2-mers) are AT, TG, GC and CA
- The 3 base long k-mers (3-mers) are ATG, TGC and GCA
- The 4 base long k-mers (4-mers) are ATGC, TGCA
- The 5 base long k-mer (5-mer) is ATGCA
K-mers are useful in several ways. * rare k-mers may indicate sequencing errors * some k-mers can be used to identify specific genomes * alignment programs can use k-mers to map reads to locations
Since it is faster to compute k-mers than running an alignment - data interpretation can go much faster.
To use the jellyfish k-mer counter, first download some sequence data.
efetch -id KU182908 -db nucleotide -format fasta > KU182908.fa
jellyfish count -C -m 10 -s10M KU182908.fa
To create the histogram of k-mers:
jellyfish histo mer_counts.jf
If you want to see those k-mers that appear at least 7 times:
jellyfish dump -L 7 mer_counts.jf
where "dump" subcommand outputs a list of all the k-mers along with their counts. Pick one of the k-mers in the list and see if it is present 7 times.
cat KU182908.fa | dreg -filter -pattern TTAAGAAAAA