Lesson 6: Using Unix commands to work with text files and tabular data
Learning objectives
After this lesson, participants will
- Know how to download data from the web
- Work with tape archives (tar files)
- Be able to create text files and tabular data
- Be able to view and page through file content
- Know how to use Unix commands to perform basic wrangling tasks such as
- Pattern searching
- Substitution
- Subsetting
- Sorting
To get started, open the Command Prompt (Windows) or the Terminal (Mac) and connect to Biowulf. Remember you need to be connected to the NIH network either by being on campus or through VPN. Recall from lesson 1 that you use the ssh
command below to connect to Biowulf, where username is the student account ID that was assigned to you (see student assignments). Remember that when prompted to enter your password, you are not going to be able to see it, but keep typing.
ssh username@biowulf.nih.gov
Change into the data directory when connected.
cd /data/username
Getting example data
Copy the example_rna_sequencing folder in /data/classes/BTEP
to the data directory and then change into it.
cp -r /data/classes/BTEP/example_rna_sequencing .
cd example_rna_sequencing
See what is in the example_rna_sequencing folder.
ls -l
There are two folders, hbr_uhr_fastq and hcc1395_fastq that contains sequencing data for the human brain and universal human reference (HBR/UHR) and the HCC1395 studies as documented in the RNA sequencing tutorial provided by the Griffith lab. The csv or comma separate files contain gene expression counts and differential gene expression results from these two studies.
-rwxr-x--- 1 wuz8 wuz8 31666 Jan 19 12:04 hbr_uhr_counts.csv
-rwxr-x--- 1 wuz8 wuz8 104473 Jan 19 12:04 hbr_uhr_deg_results.csv
drwxr-x--- 2 wuz8 wuz8 4096 Jan 19 12:04 hbr_uhr_fastq
drwxr-x--- 2 wuz8 wuz8 4096 Jan 19 12:04 hcc1359_fastq
-rwxr-x--- 1 wuz8 wuz8 33230 Jan 19 12:04 hcc1395_counts.csv
-rwxr-x--- 1 wuz8 wuz8 106687 Jan 19 12:04 hcc1395_deg_results.csv
Downloading data from the web
Go back to the data directory, which is one folder up.
cd ../
Suppose you need to download the fastq files for the HCC1395 study from the Griffith lab RNA sequencing tutorial page using this URL http://genomedata.org/rnaseq-tutorial/practical.tar, there are two Unix commands to this. These are wget
and curl
. Note in the URL for HCC1395 data that the file name is practical.tar.
Make a folder called hcc1395_fastq_download.
mkdir hcc1395_fastq_download
cd hcc1395_fastq_download
To download using wget
just provide the URL to the data as an argument. Include -O
specify a file name. The name practical.tar is not informative, so change download it as hcc1395_fastq.tar. Use wget
to download for this class.
wget -O hcc1395_fastq.tar http://genomedata.org/rnaseq-tutorial/practical.tar
To download using curl
, provide the data's URL and name of the output using -o
or '-O'. The -o
option enables users to supply a name other than is in the data's URL (ie. practical.tar). The -O
option downloads the data and names it using the that shown in the data's URL.
curl -o hcc1395_fastq.tar http://genomedata.org/rnaseq-tutorial/practical.tar
curl -O http://genomedata.org/rnaseq-tutorial/practical.tar
After download
ls
hcc1395_fastq.tar
Tape archive or tar files
The wget
command above downloaded the HCC1395 fastq files in the form of a tape archive or tar file. Tape archives are essentially packages or collections of files and folders and enable easy transfer and sharing.
To get the HCC1395 fastq files, this file needs to be unpacked using the tar
command with the options below.
-x
: extract files from an archive-v
: verbosely list files processed-f
: use archive file or device ARCHIVE
tar -xvf hcc1395_fastq.tar
The contents that are unpacked are displayed because the -v
option was included.
hcc1395_normal_rep1_r1.fastq.gz
hcc1395_normal_rep1_r2.fastq.gz
hcc1395_normal_rep2_r1.fastq.gz
hcc1395_normal_rep2_r2.fastq.gz
hcc1395_normal_rep3_r1.fastq.gz
hcc1395_normal_rep3_r2.fastq.gz
hcc1395_tumor_rep1_r1.fastq.gz
hcc1395_tumor_rep1_r2.fastq.gz
hcc1395_tumor_rep2_r1.fastq.gz
hcc1395_tumor_rep2_r2.fastq.gz
hcc1395_tumor_rep3_r1.fastq.gz
hcc1395_tumor_rep3_r2.fastq.gz
Go back to the date directory and make a folder called hbr_uhr_fastq. Download the fastq files for HBR/UHR from http://genomedata.org/rnaseq-tutorial/HBR_UHR_ERCC_ds_5pc.tar and name the tar file hbr_uhr_fastq.tar. Finish the task by unpacking.
cd ../
mkdir hbr_uhr_fastq_download
cd hbr_uhr_fastq_download
Solution
curl -o hbr_uhr_fastq.tar http://genomedata.org/rnaseq-tutorial/HBR_UHR_ERCC_ds_5pc.tar
tar -xvf hbr_uhr_fastq.tar
Viewing and working with fastq files using Unix commands
Change back to /data/username/hcc1395_fastq_download
for this exercise.
cd /data/username/hcc1395_fastq_download
The fastq files were compressed to save on storage space as evident by the extension "gz", which stands for gzip. Usually, cat
can be used to view fastq files in addition to text files and tabular data. However, it will not work with compressed files. Fortunately, there is a work around for this without having to uncompress the files. Hit control-c or control-z to exit zcat
and return to the command prompt.
zcat hcc1395_normal_rep1_r1.fastq.gz
While zcat
will print all of the sequences for a fastq file to the terminal, it is not a convenient way to view large files. The |
or pipe can be used to send output of one command to another. Here, it is used to send the output of zcat
to less
, which allows users to scroll through file content line by line using the up and down arrows or page by page using the space bar. Hit q
to exit less
and return to the prompt.
zcat hcc1395_normal_rep1_r1.fastq.gz | less
Alternatively, the head
command can be used to look at the first couple of lines (defaults to first 10 lines).
zcat hcc1395_normal_rep1_r1.fastq.gz | head
Including the -n
option in head enables users specify the number of lines to print. Because sequences in a fastq file come in four lines (ie. header, sequence, "+", quality score), -n 4
could be used to view the first sequnence in a fastq file.
zcat hcc1395_normal_rep1_r1.fastq.gz | head -n 4
@K00193:38:H3MYFBBXX:4:1101:10003:44458/1
TTCCTTATGAAACAGGAAGAGTCCCTGGGCCCAGGCCTGGCCCACGGTTGTCAAGGCACATCATTGCCAGCAAGCTGAAGCATACCAGCAGCCACAACCTAGATCTCATTCCCAACCCAAAGTTCTGACTTCTGTACAAACTCGTTTCCAG
+
AAFFFKKKKKKKKKKKKKKKKKKKKKKKKFKKFKKKKF<AAKKKKKKKKKKKKKKKKFKKKFKKKKKKKKKKKFKAFKKKKKKKKKKKKKKKKKKKKKKKKKKKFKKKKKKKKKKKKFKKKKKKKKKKKKFKFFKKKKKKKKKKKKFKKKK
The tail
command is used to look at lines at the bottom of a file. Again, it has -n
option for users to specify how many lines.
Creating text files and tabular data
For this exercise, change back into the data directory.
cd /data/username
Make a directory called text_files_and_tabular_data.
mkdir text_files_and_tabular_data
cd text_files_and_tabular_data
The touch
command can be used to create empty files.
touch example.txt
If we do cat example.txt
, nothing will be printed to the terminal. To edit and add stuff to example.txt do the following. The -L
option prevents the addition of a new line at the end of the file.
nano -L example.txt
Add the following to example.txt. Hit control-x and then "y" to save and exit nano and return to the prompt.
DNA sequencing
RNA sequencing
ChIp sequencing
ATAC sequencing
Now, doing cat example.txt
will print the file contents to the terminal. To add a new line that says scRNA sequencing, do the follwing.
echo "scRNA sequencing" >> example.txt
cat example.txt
DNA sequencing
RNA sequencing
ChIp sequencing
ATAC sequencing
scRNA sequencing
Pattern searching
The command grep
can be used to search for patterns in a file. For instance, to get the sequencing modalities for RNA in example.txt use
grep RNA example.txt
RNA sequencing
scRNA sequencing
The grep
output can be saved using the redirect >
.
grep RNA example.txt > rna_sequencing.txt
cat rna_sequencing.txt
RNA sequencing
scRNA sequencing
Including the -o
option in grep
will print only the pattern.
grep -o ATAC example.txt
ATAC
Note
grep
is case senstive by defaul. Use the -i
option to ignore case.
The -v
option will return lines that do not contain the pattern.
grep -v RNA example.txt
DNA sequencing
ChIp sequencing
ATAC sequencing
Deleting and adding lines
The sed
command can be used to perform various transformations on files. For instance, to delete the first line corresponding to DNA sequencing in example.txt use the following where the option d
denotes delete and 1
is added to indicate delete line 1. To delete the second line, replace 1
with 2
. Note the output of the sed
command below was not saved but only printed to the terminal.
sed '1d' example.txt
RNA sequencing
ChIp sequencing
ATAC sequencing
scRNA sequencing
To insert a line with sed
use the i
option followed by the text to insert. To code below inserts "Spatial transcriptomics" into the first line of example.txt.
sed '1i Spatial transcriptomics' example.txt
Pattern subsitution
The sed
command can also be used for substitution. For instance, to change all of the instance of sequencing to "seq" in example.txt the following can be used. The pattern, sequencing is given first, followed by the substitute (ie. seq). These are separated by "/".
sed 's/sequencing/seq/' example.txt
DNA seq
RNA seq
ChIp seq
ATAC seq
scRNA seq
Working with tabular data
Data analysis requires investigators to work with tabular data either in comma separated (csv) or tab separated format. Columns in a comma separated file are separated by commas. Those in tab separated files are separated by tabs.
Change into the /data/username/example_rna_sequencing folder.
cd /data/username/example_rna_sequencing
Then take look at the first few lines of hbr_uhr_counts.csv, which contains gene expression counts for the HBR/UHR study.
cat hbr_uhr_counts.csv | head
The columns Geneid, HBR_1.bam,HBR_2.bam,HBR_3.bam,UHR_1.bam,UHR_2.bam,UHR_3.bam are separated by commas. The way cat
displays the file content is not nice.
Geneid,HBR_1,HBR_2,HBR_3,UHR_1,UHR_2,UHR_3
U2,0,0,0,0,0,0
CU459211.1,0,0,0,0,0,0
CU104787.1,0,0,0,0,0,0
BAGE5,0,0,0,0,0,0
ACTR3BP6,0,0,0,0,0,0
5_8S_rRNA,0,0,0,0,0,0
AC137488.1,0,0,0,0,0,0
AC137488.2,0,0,0,0,0,0
CU013544.1,0,0,0,0,0,0
Sending the output of head
to column
allows the columns in the file to be printed nicely aligned. In the column
command, -t
tells is to create a table, and -s
is used specify the column separator (ie. comma).
head hbr_uhr_counts.csv | column -t -s ','
Geneid HBR_1 HBR_2 HBR_3 UHR_1 UHR_2 UHR_3
U2 0 0 0 0 0 0
CU459211.1 0 0 0 0 0 0
CU104787.1 0 0 0 0 0 0
BAGE5 0 0 0 0 0 0
ACTR3BP6 0 0 0 0 0 0
5_8S_rRNA 0 0 0 0 0 0
AC137488.1 0 0 0 0 0 0
AC137488.2 0 0 0 0 0 0
CU013544.1 0 0 0 0 0 0
Subset tabular data by column
The cut
command can be used to subset tabular data by column. To do this, specify the column number using -f
option and the column separator using the -d
option. For instance, to subset out the Geneid, HBR_1, and UHR_1 columns which corresponds to columns 1, 2, and 5 the following can be used. The column numbers will be separated by commas. The column separator -d
will be set to ',' for comma.
cut -f1,2,5 -d ',' hbr_uhr_counts.csv | head
Geneid,HBR_1,UHR_1
U2,0,0
CU459211.1,0,0
CU104787.1,0,0
BAGE5,0,0
ACTR3BP6,0,0
5_8S_rRNA,0,0
AC137488.1,0,0
AC137488.2,0,0
CU013544.1,0,0
To subset a range of columns, for instance 1 thru 4 in hbr_uhr_counts.csv, set -f
to 1-4.
cut -f1-4 -d ',' hbr_uhr_counts.csv | head
Geneid,HBR_1,HBR_2,HBR_3
U2,0,0,0
CU459211.1,0,0,0
CU104787.1,0,0,0
BAGE5,0,0,0
ACTR3BP6,0,0,0
5_8S_rRNA,0,0,0
AC137488.1,0,0,0
AC137488.2,0,0,0
CU013544.1,0,0,0
Subset tabular data by row
To subset the counts for gene RABL2B in hbr_uhr_counts.csv, the awk
command can be used. awk
is one of the many Unix commands that are suitable for data processing. In the construct below
-F
is theawk
option to specify the field separator (a comma in this clase)- The commands within
awk
are enclosed in ''FNR==1
tellsawk
to print the first row, which is the column heading||
is the OR operator$1
is column 1 (the Geneid) and this is set using==
to the gene RABL2B- Essentially, this command tells
awk
to print the first row or the row where the Gene ID matches RABL2B in hbr_uhr_counts.csv
awk -F, 'FNR==1 || $1=="RABL2B"' hbr_uhr_counts.csv
Geneid,HBR_1,HBR_2,HBR_3,UHR_1,UHR_2,UHR_3
RABL2B,74,62,54,68,50,47
Sorting
The final exercise in this lesson is to sort. To do this create text file called sorting_example.txt in the data directory.
cd /data/username
nano -L sorting_example.csv
Then enter the following. Hit control-x and "y" to save to exit nano and return to the prompt.
HIF1a,35
EPAS1,40
ADA,25
AMPD3,15
ADSS,20
ADSL,50
HPRT,75
To sort column 2 of this file from the lowest value to highest value the following can be used. sort
is the Unix command for sorting.
-t
prompts for the column separator (comma in this case)-k
prompts for the column number to sort on (column 2, so the construct is-k2
)-n
indicates to sort numerically
sort -t ',' -k2 -n sorting_example.csv
AMPD3,15
ADSS,20
ADA,25
HIF1a,35
EPAS1,40
ADSL,50
HPRT,75
The -r
option will sort in reverse.
sort -t ',' -k2 -n -r sorting_example.csv
HPRT,75
ADSL,50
EPAS1,40
HIF1a,35
ADA,25
ADSS,20
AMPD3,15