This page uses content directly from the Biostars Handbook by Istvan Albert (https://www.biostarhandbook.com).
Always remember to load the bioinformatics environment.
conda activate bioinfo
SAM files
SAM format is TAB-delimited, line-oriented, human-readable text format with a 1. Header section - with metadata on each line 2. Alignment section - each line provides alignment information
SAM format specification on Github
SAM files are used to store alignments in a standardized efficient format that allows quick access to the alignments based on coordinates.
Decoding SAM flags (Picard) - use this utility to identify the properties of a read based on SAM flag values, or to find out what SAM flag value would be given a combination of properties.
BAM files
BAM files are a binary, compressed information, machine-readable representation of the SAM format.
BAM files are sorted by alignment coordinate (or read names) for quick accession.
BAM files are created from SAM files. You may be able to download them directly from some data sites or create them yourself.
Tools to manipulate BAM files include: 1. samtools 2. bamtools 3. picard
In BAM files, may be looking for:
alignments that match an attribute such as strand, mate or mapping quality or - alignments within a certain region of the genome
Creating SAM and BAM files.
#SAM files are created from alignment programs such as bowtie2 and bwa.
bwa mem reference_sequence sequence_1.fastq sequence_2.fastq > alignement.sam
#Convert SAM to sorted BAM with samtools.
samtools sort alignment.sam > alignment.bam
#Index the BAM file with samtools.
samtools index alignment.bam
How to extract a section of the BAM file?
Using data we downloaded previously:
bwa mem refs/AF086833.fa SRR1972739_1.fastq SRR1972739_2.fastq > SRR1972739.bwa.sam
samtools view -S -b SRR1972739.bwa.sam > SRR1972739.bwa.bam
samtools sort SRR1972739.bwa.bam -o sorted_SRR1972739.bwa.bam
samtools index sorted_SRR1972739.bwa.bam
samtools view -b sorted_SRR1972739.bwa.bam AF086833:3050-3199 > selected.bam
samtools index selected.bam
Load in IGV and view intervals.
Select from or filter data from BAM files
- Selecting means to keep alignments that match a condition.
- Filtering means to remove alignments that match a condition.
Filtering on flags can be done via samtools by passing the "-f" and "-F" parameters.
-f flag (include only alignments where bits match the flag)
-F flag (include only alignments where bits DO NOT match the flag)
samtools flags 4
#0x4 4 UNMAP
#therefore when flag 4 is set, the read is **unmapped/unaligned**
View alignments where read did not align. Then count them.
samtools view -f 4 SRR1972739.bwa.bam | head
samtools view -c -f 4 SRR1972739.bwa.bam
#5461
read unmapped (0x4)
-c -f 4 is counting alignments with the property/condition (-c) that the reads are unmapped (unaligned).
Now we can reverse the flag (from -f to -F) and view the number of alignments.
samtools view -c -F 4 SRR1972739.bwa.bam
#15279
To select forward or reverse alignments.
#filter out unmap (4) and reverse (16)
samtools view -F 20 -b SRR1972739.bwa.bam > selected.bam
samtools index selected.bam

To select reverse alignments.
samtools view -F 4 -f 16 -b SRR1972739.bwa.bam > reverse_selected.bam
samtools index reverse_selected.bam
To get an overview of alignments in a BAM file
samtools flagstat SRR1972739.bwa.bam
produces this
20740 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
740 + 0 supplementary
0 + 0 duplicates
15279 + 0 mapped (73.67% : N/A)
20000 + 0 paired in sequencing
10000 + 0 read1
10000 + 0 read2
14480 + 0 properly paired (72.40% : N/A)
14528 + 0 with itself and mate mapped
11 + 0 singletons (0.05% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
bamtools stats -in SRR1972739.bwa.bam
**********************************************
Stats for BAM file(s):
**********************************************
Total reads: 20740
Mapped reads: 15279 (73.6692%)
Forward strand: 14393 (69.3973%)
Reverse strand: 6347 (30.6027%)
Failed QC: 0 (0%)
Duplicates: 0 (0%)
Paired-end reads: 20740 (100%)
'Proper-pairs': 15216 (73.3655%)
Both pairs mapped: 15268 (73.6162%)
Read 1: 10357
Read 2: 10383
Singletons: 11 (0.0530376%)
What is a proper-pair?
A proper (or concordant) pair is defined as "each segment properly aligned according to the aligner", meaning that the read pair aligns in a expected manner where the reads are oriented towards one another and the distance between the outer edges is within expected ranges.
Types of Alignments
- Primary (representative) - represents the "best(?)" alignment.
- Secondary - a read that produces multiple alignments in the genome. This is caused primarily by repeats.
- Supplementary, or chimeric alignment - an alignment where the read partially matches different regions of the genome without overlapping the same alignment.
Each read will have one primary alignment and other secondary and supplemental alignments.
To select primary alignments (there is no flag for primary alignments, so you must subtract out the secondary and supplementary alignments).
Use "samtools flags" to find the flags for secondary and supplementary reads. Or check out "Decoding SAM flags" at https://broadinstitute.github.io/picard/explain-flags.html
samtools flags SUPPLEMENTARY, SECONDARY
#256 0x100 SECONDARY .. secondary alignment
#2048 0x800 SUPPLEMENTARY .. supplementary alignment
samtools view -c -F 4 -F 2304 SRR1972739.bwa.bam > output.bam