Lesson 6: Using Unix commands to work with text files and tabular data

Learning objectives

After this lesson, participants will

Know how to download data from the web
Work with tape archives (tar files)
Be able to create text files and tabular data
Be able to view and page through file content
Know how to use Unix commands to perform basic wrangling tasks such as
- Pattern searching
- Substitution
- Subsetting
- Sorting

To get started, open the Command Prompt (Windows) or the Terminal (Mac) and connect to Biowulf. Remember you need to be connected to the NIH network either by being on campus or through VPN. Recall from lesson 1 that you use the ssh command below to connect to Biowulf, where username is the student account ID that was assigned to you (see student assignments). Remember that when prompted to enter your password, you are not going to be able to see it, but keep typing.

ssh username@biowulf.nih.gov

Change into the data directory when connected.

cd /data/username

Getting example data

Copy the example_rna_sequencing folder in /data/classes/BTEP to the data directory and then change into it.

cp -r /data/classes/BTEP/example_rna_sequencing .

cd example_rna_sequencing

See what is in the example_rna_sequencing folder.

ls -l

There are two folders, hbr_uhr_fastq and hcc1395_fastq that contains sequencing data for the human brain and universal human reference (HBR/UHR) and the HCC1395 studies as documented in the RNA sequencing tutorial provided by the Griffith lab. The csv or comma separate files contain gene expression counts and differential gene expression results from these two studies.

-rwxr-x--- 1 wuz8 wuz8  31666 Jan 19 12:04 hbr_uhr_counts.csv
-rwxr-x--- 1 wuz8 wuz8 104473 Jan 19 12:04 hbr_uhr_deg_results.csv
drwxr-x--- 2 wuz8 wuz8   4096 Jan 19 12:04 hbr_uhr_fastq
drwxr-x--- 2 wuz8 wuz8   4096 Jan 19 12:04 hcc1359_fastq
-rwxr-x--- 1 wuz8 wuz8  33230 Jan 19 12:04 hcc1395_counts.csv
-rwxr-x--- 1 wuz8 wuz8 106687 Jan 19 12:04 hcc1395_deg_results.csv

Downloading data from the web

Go back to the data directory, which is one folder up.

cd ../

Suppose you need to download the fastq files for the HCC1395 study from the Griffith lab RNA sequencing tutorial page using this URL http://genomedata.org/rnaseq-tutorial/practical.tar, there are two Unix commands to this. These are wget and curl. Note in the URL for HCC1395 data that the file name is practical.tar.

Make a folder called hcc1395_fastq_download.

mkdir hcc1395_fastq_download

cd hcc1395_fastq_download

To download using wget just provide the URL to the data as an argument. Include -O specify a file name. The name practical.tar is not informative, so change download it as hcc1395_fastq.tar. Use wget to download for this class.

wget -O hcc1395_fastq.tar http://genomedata.org/rnaseq-tutorial/practical.tar

To download using curl, provide the data's URL and name of the output using -o or '-O'. The -o option enables users to supply a name other than is in the data's URL (ie. practical.tar). The -O option downloads the data and names it using the that shown in the data's URL.

curl -o hcc1395_fastq.tar http://genomedata.org/rnaseq-tutorial/practical.tar

curl -O http://genomedata.org/rnaseq-tutorial/practical.tar

After download

ls

hcc1395_fastq.tar

Tape archive or tar files

The wget command above downloaded the HCC1395 fastq files in the form of a tape archive or tar file. Tape archives are essentially packages or collections of files and folders and enable easy transfer and sharing.

To get the HCC1395 fastq files, this file needs to be unpacked using the tar command with the options below.

-x: extract files from an archive
-v: verbosely list files processed
-f: use archive file or device ARCHIVE

tar -xvf hcc1395_fastq.tar

The contents that are unpacked are displayed because the -v option was included.

hcc1395_normal_rep1_r1.fastq.gz
hcc1395_normal_rep1_r2.fastq.gz
hcc1395_normal_rep2_r1.fastq.gz
hcc1395_normal_rep2_r2.fastq.gz
hcc1395_normal_rep3_r1.fastq.gz
hcc1395_normal_rep3_r2.fastq.gz
hcc1395_tumor_rep1_r1.fastq.gz
hcc1395_tumor_rep1_r2.fastq.gz
hcc1395_tumor_rep2_r1.fastq.gz
hcc1395_tumor_rep2_r2.fastq.gz
hcc1395_tumor_rep3_r1.fastq.gz
hcc1395_tumor_rep3_r2.fastq.gz

Go back to the date directory and make a folder called hbr_uhr_fastq. Download the fastq files for HBR/UHR from http://genomedata.org/rnaseq-tutorial/HBR_UHR_ERCC_ds_5pc.tar and name the tar file hbr_uhr_fastq.tar. Finish the task by unpacking.

cd ../

mkdir hbr_uhr_fastq_download

cd hbr_uhr_fastq_download

Solution

curl -o hbr_uhr_fastq.tar http://genomedata.org/rnaseq-tutorial/HBR_UHR_ERCC_ds_5pc.tar

tar -xvf hbr_uhr_fastq.tar

Viewing and working with fastq files using Unix commands

Change back to /data/username/hcc1395_fastq_download for this exercise.

cd /data/username/hcc1395_fastq_download

The fastq files were compressed to save on storage space as evident by the extension "gz", which stands for gzip. Usually, cat can be used to view fastq files in addition to text files and tabular data. However, it will not work with compressed files. Fortunately, there is a work around for this without having to uncompress the files. Hit control-c or control-z to exit zcat and return to the command prompt.

zcat hcc1395_normal_rep1_r1.fastq.gz

While zcat will print all of the sequences for a fastq file to the terminal, it is not a convenient way to view large files. The | or pipe can be used to send output of one command to another. Here, it is used to send the output of zcat to less, which allows users to scroll through file content line by line using the up and down arrows or page by page using the space bar. Hit q to exit less and return to the prompt.

zcat hcc1395_normal_rep1_r1.fastq.gz | less

Alternatively, the head command can be used to look at the first couple of lines (defaults to first 10 lines).

zcat hcc1395_normal_rep1_r1.fastq.gz | head

Including the -n option in head enables users specify the number of lines to print. Because sequences in a fastq file come in four lines (ie. header, sequence, "+", quality score), -n 4 could be used to view the first sequnence in a fastq file.

zcat hcc1395_normal_rep1_r1.fastq.gz | head -n 4

@K00193:38:H3MYFBBXX:4:1101:10003:44458/1
TTCCTTATGAAACAGGAAGAGTCCCTGGGCCCAGGCCTGGCCCACGGTTGTCAAGGCACATCATTGCCAGCAAGCTGAAGCATACCAGCAGCCACAACCTAGATCTCATTCCCAACCCAAAGTTCTGACTTCTGTACAAACTCGTTTCCAG
+
AAFFFKKKKKKKKKKKKKKKKKKKKKKKKFKKFKKKKF<AAKKKKKKKKKKKKKKKKFKKKFKKKKKKKKKKKFKAFKKKKKKKKKKKKKKKKKKKKKKKKKKKFKKKKKKKKKKKKFKKKKKKKKKKKKFKFFKKKKKKKKKKKKFKKKK

The tail command is used to look at lines at the bottom of a file. Again, it has -n option for users to specify how many lines.

Creating text files and tabular data

For this exercise, change back into the data directory.

cd /data/username

Make a directory called text_files_and_tabular_data.

mkdir text_files_and_tabular_data

cd text_files_and_tabular_data

The touch command can be used to create empty files.

touch example.txt

If we do cat example.txt, nothing will be printed to the terminal. To edit and add stuff to example.txt do the following. The -L option prevents the addition of a new line at the end of the file.

nano -L example.txt

Add the following to example.txt. Hit control-x and then "y" to save and exit nano and return to the prompt.

DNA sequencing
RNA sequencing
ChIp sequencing
ATAC sequencing

Now, doing cat example.txt will print the file contents to the terminal. To add a new line that says scRNA sequencing, do the follwing.

echo "scRNA sequencing" >> example.txt

cat example.txt

DNA sequencing
RNA sequencing
ChIp sequencing
ATAC sequencing
scRNA sequencing

Pattern searching

The command grep can be used to search for patterns in a file. For instance, to get the sequencing modalities for RNA in example.txt use

grep RNA example.txt

RNA sequencing
scRNA sequencing

The grep output can be saved using the redirect >.

grep RNA example.txt > rna_sequencing.txt

cat rna_sequencing.txt

RNA sequencing
scRNA sequencing

Including the -o option in grep will print only the pattern.

grep -o ATAC example.txt

ATAC

Note

grep is case senstive by defaul. Use the -i option to ignore case.

The -v option will return lines that do not contain the pattern.

grep -v RNA example.txt

DNA sequencing
ChIp sequencing
ATAC sequencing

Deleting and adding lines

The sed command can be used to perform various transformations on files. For instance, to delete the first line corresponding to DNA sequencing in example.txt use the following where the option d denotes delete and 1 is added to indicate delete line 1. To delete the second line, replace 1 with 2. Note the output of the sed command below was not saved but only printed to the terminal.

sed '1d' example.txt

RNA sequencing
ChIp sequencing
ATAC sequencing
scRNA sequencing

To insert a line with sed use the i option followed by the text to insert. To code below inserts "Spatial transcriptomics" into the first line of example.txt.

sed '1i Spatial transcriptomics' example.txt

Pattern subsitution

The sed command can also be used for substitution. For instance, to change all of the instance of sequencing to "seq" in example.txt the following can be used. The pattern, sequencing is given first, followed by the substitute (ie. seq). These are separated by "/".

sed 's/sequencing/seq/' example.txt

DNA seq
RNA seq
ChIp seq
ATAC seq
scRNA seq

Working with tabular data

Data analysis requires investigators to work with tabular data either in comma separated (csv) or tab separated format. Columns in a comma separated file are separated by commas. Those in tab separated files are separated by tabs.

Change into the /data/username/example_rna_sequencing folder.

cd /data/username/example_rna_sequencing

Then take look at the first few lines of hbr_uhr_counts.csv, which contains gene expression counts for the HBR/UHR study.

cat hbr_uhr_counts.csv | head

The columns Geneid, HBR_1.bam,HBR_2.bam,HBR_3.bam,UHR_1.bam,UHR_2.bam,UHR_3.bam are separated by commas. The way cat displays the file content is not nice.

Geneid,HBR_1,HBR_2,HBR_3,UHR_1,UHR_2,UHR_3
U2,0,0,0,0,0,0
CU459211.1,0,0,0,0,0,0
CU104787.1,0,0,0,0,0,0
BAGE5,0,0,0,0,0,0
ACTR3BP6,0,0,0,0,0,0
5_8S_rRNA,0,0,0,0,0,0
AC137488.1,0,0,0,0,0,0
AC137488.2,0,0,0,0,0,0
CU013544.1,0,0,0,0,0,0

Sending the output of head to column allows the columns in the file to be printed nicely aligned. In the column command, -t tells is to create a table, and -s is used specify the column separator (ie. comma).

head hbr_uhr_counts.csv | column -t -s ','

Geneid      HBR_1  HBR_2  HBR_3  UHR_1  UHR_2  UHR_3
U2          0      0      0      0      0      0
CU459211.1  0      0      0      0      0      0
CU104787.1  0      0      0      0      0      0
BAGE5       0      0      0      0      0      0
ACTR3BP6    0      0      0      0      0      0
5_8S_rRNA   0      0      0      0      0      0
AC137488.1  0      0      0      0      0      0
AC137488.2  0      0      0      0      0      0
CU013544.1  0      0      0      0      0      0

Subset tabular data by column

The cut command can be used to subset tabular data by column. To do this, specify the column number using -f option and the column separator using the -d option. For instance, to subset out the Geneid, HBR_1, and UHR_1 columns which corresponds to columns 1, 2, and 5 the following can be used. The column numbers will be separated by commas. The column separator -d will be set to ',' for comma.

cut -f1,2,5 -d ',' hbr_uhr_counts.csv | head

Geneid,HBR_1,UHR_1
U2,0,0
CU459211.1,0,0
CU104787.1,0,0
BAGE5,0,0
ACTR3BP6,0,0
5_8S_rRNA,0,0
AC137488.1,0,0
AC137488.2,0,0
CU013544.1,0,0

To subset a range of columns, for instance 1 thru 4 in hbr_uhr_counts.csv, set -f to 1-4.

cut -f1-4 -d ',' hbr_uhr_counts.csv | head

Geneid,HBR_1,HBR_2,HBR_3
U2,0,0,0
CU459211.1,0,0,0
CU104787.1,0,0,0
BAGE5,0,0,0
ACTR3BP6,0,0,0
5_8S_rRNA,0,0,0
AC137488.1,0,0,0
AC137488.2,0,0,0
CU013544.1,0,0,0

Subset tabular data by row

To subset the counts for gene RABL2B in hbr_uhr_counts.csv, the awk command can be used. awk is one of the many Unix commands that are suitable for data processing. In the construct below

-F is the awk option to specify the field separator (a comma in this clase)
The commands within awk are enclosed in ''
- FNR==1 tells awk to print the first row, which is the column heading
- || is the OR operator
- $1 is column 1 (the Geneid) and this is set using == to the gene RABL2B
- Essentially, this command tells awk to print the first row or the row where the Gene ID matches RABL2B in hbr_uhr_counts.csv

awk -F, 'FNR==1 || $1=="RABL2B"' hbr_uhr_counts.csv

Geneid,HBR_1,HBR_2,HBR_3,UHR_1,UHR_2,UHR_3
RABL2B,74,62,54,68,50,47

Sorting

The final exercise in this lesson is to sort. To do this create text file called sorting_example.txt in the data directory.

cd /data/username

nano -L sorting_example.csv

Then enter the following. Hit control-x and "y" to save to exit nano and return to the prompt.

HIF1a,35
EPAS1,40
ADA,25
AMPD3,15
ADSS,20
ADSL,50
HPRT,75

To sort column 2 of this file from the lowest value to highest value the following can be used. sort is the Unix command for sorting.

-t prompts for the column separator (comma in this case)
-k prompts for the column number to sort on (column 2, so the construct is -k2)
-n indicates to sort numerically

sort -t ',' -k2 -n sorting_example.csv

AMPD3,15
ADSS,20
ADA,25
HIF1a,35
EPAS1,40
ADSL,50
HPRT,75

The -r option will sort in reverse.

sort -t ',' -k2 -n -r sorting_example.csv

HPRT,75
ADSL,50
EPAS1,40
HIF1a,35
ADA,25
ADSS,20
AMPD3,15