Assume for all instances below that a single FASTA reference
ref.fasta
and a single FASTQ read file
reads.fastq
are in the same directory as the program executable
Running
bowtie
:
- Build the index
-
bowtie-build ref.fasta reference
- Single-end alignments
-
bowtie -S reference reads.fastq output.sam
(-S
required for SAM output)
Running
bwa
- Build the index
- Single-end alignment
-
bwa aln ref.fasta reads.fastq > output.sai
- Convert to SAM
-
bwa samse ref.fasta output.sai reads.fastq > output.sam
Running
Mosaik
- Build the index
-
MosaikBuild -fr ref.fasta -oa ref.dat
- Convert the reads
-
MosaikBuild -q reads.fastq -out reads.dat -st illumina
(illumina
is the machine the reads were produced by)
- Single-end alignments
-
MosaikAligner -in reads.dat -out output.dat -ia ref.dat -hs 14 -act 17 -mm 2 unique
- Convert to SAM
-
MosaikText -in output.dat -sam output.sam
Note that the above was adapted from the
Build
and
Align
scripts from in the
data
directory of the Mosaik download.
Running
readaligner
- Build the index
- Single-end alignments
-
readaligner -k 2 -v -s -o output.sam ref.fasta -q reads.fastq
-
-k 2
for up to 2 mismatches
- you can change
-k
to -i
for 2 gaps (in my experience, runs much slower)
-
-v
for verbose
-
-s
for sam output (I don't think that readaligner output is tview friendly yet...)
-
-o
to specify the output file
-
-q
to specify that input is FASTQ
If all went well, you should obtain an
output.sam
file, using any of the above programs...
You can then display this using
SAMTools
:
- Index the reference
- Convert SAM to BAM
-
samtools import ref.fasta output.sam output.bam
- Sort the BAM (orders the reads by positions in the reference)
-
samtools sort output.bam sorted
(will create sorted.bam
)
- Index the sorted BAM (unsorted BAM files cannot be indexed)
-
samtools index sorted.bam
- Display
-
samtools tview sorted.bam ref.fasta
On a side note, you can convert a BAM file to a SAM file like so:
-
samtools view sorted.bam > sorted.sam
As typing these can be a real pain, I actually automate the process using batch files...
Some notes about output...
The below three are considered "identical" for our purposes
baligner
SLXA-EAS1_89:1:1:672:654/1 0 gi|170079663|ref|NC_010473.1| 4017473 65 35M * 0 0 GCTACGGAATAAAACCAGGAACAACAGACCCAGCA cccccccccccccccccccc]c``cVcZccbSYbY NM:i:0 MD:Z:35 AS:i:35
bwa
SLXA-EAS1_89:1:1:672:654 0 gi|170079663|ref|NC_010473.1| 4017473 37 35M * 0 0 GCTACGGAATAAAACCAGGAACAACAGACCCAGCA cccccccccccccccccccc]c``cVcZccbSYbY XT:A:U NM:i:0 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:35
bowtie
SLXA-EAS1_89:1:1:672:654/1 0 gi|170079663|ref|NC_010473.1| 4017473 255 35M * 0 0 GCTACGGAATAAAACCAGGAACAACAGACCCAGCA cccccccccccccccccccc]c``cVcZccbSYbY XA:i:0 MD:Z:35 NM:i:0
Some notes about the columns:
- name: may vary slightly only
- flag: should be identical
- reference: should be identical
- position: identical for unique reads only (not for multimap, to see if this is true, check the
XT
tag at the end bwa output
-
XT:A:U
means the read is a single Unique
read, which XT:A:R
means it's a Randomly
selected multimap read
- mapping quality: different (our numbers should be nearly identical to MOSAIK, another aligner)
- cigar: identical for unique reads only (an encoding of the aligned read)
- mate reference: * for single end
- mate position: 0 for single end
- insert size: 0 for single end
- sequence: should be identical
- quality: should be identical
- everything beyond is a tag, which vary greatly between programs but when present...
-
NM
is the number of mismatches, and should be identical
-
MD
is an encoding of the aligned reference, and should be identical
A more interesting example:
baligner
SLXA-EAS1_89:1:1:713:81/1 0 gi|170079663|ref|NC_010473.1| 4058875 5 17M1D18M * 0 0 AAAAGCTGGGTGAGTGGCGATGACGCGCAATAAAA cccccccccccccccccccccccccccJcc^^bb^ NM:i:1 MD:Z:17^G18 AS:i:32
bwa
SLXA-EAS1_89:1:1:713:81 0 gi|170079663|ref|NC_010473.1| 4058875 37 15M1D20M * 0 0 AAAAGCTGGGTGAGTGGCGATGACGCGCAATAAAA cccccccccccccccccccccccccccJcc^^bb^ XT:A:U NM:i:1 X0:i:1 X1:i:0 XM:i:0 XO:i:1 XG:i:1 MD:Z:15^G20
bowtie
SLXA-EAS1_89:1:1:713:81/1 4 * 0 0 * * 0 0 AAAAGCTGGGTGAGTGGCGATGACGCGCAATAAAA cccccccccccccccccccccccccccJcc^^bb^ XM:i:0
In this case, bowtie returns an unmapped hit (it doesn't allow indels by default), while bwa returns an identical unique hit. Note that the CIGAR and MD are slightly different, but upon further investigation, the different is due to where the gap is introduced: we place the gap on the last G in the triplet of G's, while bwa places the gap on the first.
And finally, a multimap example, when
allow_multimap
is
true
:
baligner
SLXA-EAS1_89:1:1:721:668/1.1 0 gi|170079663|ref|NC_010473.1| 521562 62 35M * 0 0 GCTGTAGATCTGGAAATCGCAACGGAGGAAGAAAG ccccccccccccccccccccccccccccccbbbbb NM:i:0 MD:Z:35 AS:i:35
SLXA-EAS1_89:1:1:721:668/1.2 0 gi|170079663|ref|NC_010473.1| 634822 62 35M * 0 0 GCTGTAGATCTGGAAATCGCAACGGAGGAAGAAAG ccccccccccccccccccccccccccccccbbbbb NM:i:0 MD:Z:35 AS:i:35
bwa
SLXA-EAS1_89:1:1:721:668 0 gi|170079663|ref|NC_010473.1| 521562 0 35M * 0 0 GCTGTAGATCTGGAAATCGCAACGGAGGAAGAAAG ccccccccccccccccccccccccccccccbbbbb XT:A:R NM:i:0 X0:i:2 X1:i:2 XM:i:0 XO:i:0 XG:i:0 MD:Z:35 XA:Z:gi|170079663|ref|NC_010473.1|,+634822,35M,0;gi|170079663|ref|NC_010473.1|,+633775,35M,1;gi|170079663|ref|NC_010473.1|,+520515,35M,1;
bowtie
SLXA-EAS1_89:1:1:721:668/1 0 gi|170079663|ref|NC_010473.1| 521562 255 35M * 0 0 GCTGTAGATCTGGAAATCGCAACGGAGGAAGAAAG ccccccccccccccccccccccccccccccbbbbb XA:i:0 MD:Z:35 NM:i:0
Note that our programs finds two exact match hits, and lists them both with unique names.
bwa
is slightly interesting:
- it randomly selects one of the two positions we found
-
XT:A:R
means that it's a Random
multimap
-
X0:i:2
means that it found 2 exact matches (0 mismatches)
-
X1:i:2
means that it found 2 matches with 1 mismatch
-
XA:Z:gi|170079663|ref|NC_010473.1|,+634822,35M,0;gi|170079663|ref|NC_010473.1|,+633775,35M,1;gi|170079663|ref|NC_010473.1|,+520515,35M,1;
lists the three other matches (the 1 other perfect match + the two 1 mismatch hits)
bowtie
just picks one, and doesn't tell you that it's a multimap at all... (consequently, bowtie runs twice as fast as bwa)