-
Manual Reference Pages -
bwa (1)
NAME
bwa - Burrows-Wheeler Alignment Tool
CONTENTS
Synopsis
Description
Commands And Options
Sam Alignment Format
Notes On Short-read
Alignment
Alignment
Accuracy
Estimating Insert
Size Distribution
Memory
Requirement
Speed
Notes On Long-read Alignment
See Also
Author
License
And Citation
History
SYNOPSIS
bwa index -a bwtsw
bwa aln
short_ > aln_
bwa samse aln_ short_ >
bwa sampe aln_ aln_ >
bwa bwasw long_ >
DESCRIPTION
BWA is a fast
light-weighted tool that aligns relatively short
sequences (queries) to a sequence database
(targe), such as the
human
reference
genome.
It
implements
two
different
algorithms,
both
based
on
Burrows-Wheeler
Transform
(BWT).
The
first
algorithm
is designed for
short queries up to ~200bp with low error rate
(<3%). It does gapped global alignment w.r.t.
queries, supports
paired-end
reads,
and
is
one
of
the
fastest
short
read
alignment
algorithms
to
date
while also
visiting
suboptimal
hits.
The
second
algorithm, BWA-SW, is designed for long
reads with more errors. It performs heuristic
Smith-Waterman-like alignment to find
high-scoring
local
hits
(and
thus
chimera).
On
low-
error
short
queries,
BWA-SW
is
slower
and
less
accurate
than
the
first
algorithm,
but on long
queries, it is better.
For both
algorithms, the database file in the FASTA format
must be first indexed with the
‘index’
command, which
typically takes
a few hours. The first
algorithm is implemented via the
‘aln’
command, which finds
the suffix array (SA) coordinates of good hits
of each individual read, and the
‘samse/sampe’
command, which
converts SA coordinates to chromosomal coordinate
and pairs reads
(for ‘sampe’). The
second algorithm is invoked
by the
‘bwasw’
command. It works
for single-end reads only.
COMMANDS
AND OPTIONS
index
bwa index [-p prefix] [-a algoType]
[-c] <>
Index database sequences in
the FASTA format.
OPTIONS:
-c
Build color-space index. The input fast
should be in nucleotide space.
-p
STR
Prefix of the output database [same as
db filename]
-a
STR
Algorithm for
constructing BWT index. Available options are:
is
IS linear-time algorithm for
constructing suffix array. It requires 5.37N
memory where N is
the size of the
database. IS is moderately fast, but does not work
with database larger than
2GB. IS is
the default algorithm due to its simplicity. The
current codes for IS algorithm
are
reimplemented by Yuta Mori.
bwtsw
Algorithm implemented
in BWT-SW. This method works with the whole human
genome, but it does
not work with
database smaller than 10MB and it is usually
slower than IS.
aln
bwa
aln
[-n
maxDiff]
[-o
maxGapO]
[-e
maxGapE]
[-d
nDelTail]
[-i
nIndelEnd]
[-k
maxSeedDiff]
[-l
seedLen]
[-t
nThrds]
[-cRN] [-M misMsc]
[-O gapOsc] [-E gapEsc] [-q trimQual] <> <> > <>
Find
the
SA
coordinates
of
the
input
reads.
Maximum
maxSeedDiff
differences
are
allowed
in
the
first
seedLen
subsequence
and maximum
maxDiff
differences are
allowed in the whole sequence.
OPTIONS:
-n
NUM
Maximum edit distance
if the value is INT, or the fraction of missing
alignments given 2% uniform base
error
rate
if
FLOAT.
In
the
latter
case,
the
maximum
edit
distance
is
automatically
chosen
for
different
read lengths. [0.04]
-o
INT
Maximum number of gap opens [1]
-e
INT
Maximum number of gap extensions, -1
for k-difference mode (disallowing long gaps) [-1]
-d
INT
Disallow a long deletion within INT bp
towards the 3’
-end [16]
-i
INT
Disallow an indel within INT bp towards
the ends [5]
-l
INT
Take
the
first
INT
subsequence
as
seed.
If
INT
is
larger
than
the
query
sequence,
seeding
will
be
disabled.
For long reads, this option is
typically ranged from 25 to 35 for
‘
-
k 2’.
[inf]
-k
INT
Maximum edit distance in the seed [2]
-t
INT
Number of threads (multi-threading
mode) [1]
-M
INT
Mismatch penalty. BWA will not search
for suboptimal hits with a score lower than
(bestScore-misMsc).
[3]
-O
INT
Gap open penalty [11]
-E
INT
Gap extension penalty [4]
-R
INT
Proceed with suboptimal alignments if
there are no more than INT equally best hits. This
option only
affects
paired-
end
mapping.
Increasing
this
threshold
helps
to
improve
the
pairing
accuracy
at
the
cost
of speed, especially
for short reads (~32bp).
-c
Reverse query but not complement it,
which is required for alignment in the color
space.
-N
Disable iterative search. All hits with
no more than
maxDiff
differences will be found. This mode is
much slower than the default.
-q
INT
Parameter
for
read
trimming.
BWA
trims
a
read
down
to
argmax_x{sum_{i=x+1}^l(INT-q_i)}
if
q_l
where
l is the original read length. [0]
-I
The input is
in the Illumina 1.3+ read format (quality equals
ASCII-64).
-B
INT
Length
of
barcode
starting
from
the
5’
-end.
When
INT
is
positive,
the
barcode
of
each
read
will
be
trimmed
before mapping and
will be written at the
BC
SAM tag. For paired-end reads, the barcode from
both ends
are concatenated. [0]
-b
Specify the
input read sequence file is the BAM format. For
paired-end data, two ends in a pair must
be grouped together and options
-1
or
-2
are usually applied to
specify which end should be mapped.
Typical command lines for mapping pair-
end data in the BAM format are:
bwa
aln -b1 >
bwa aln -b2 >
bwa sampe >
-0
When
-b
is specified, only use
single-end reads in mapping.
-1
When
-b
is specified, only use
the first read in a read pair in mapping (skip
single-end reads and the
second reads).
-2
When
-b
is specified, only use
the second read in a read pair in mapping.
samse
bwa samse
[-n maxOcc] <> <> <> > <>
Generate
alignments in the SAM format given single-end
reads. Repetitive hits will be randomly chosen.
OPTIONS:
-n
INT
Maximum
number
of alignments to
output
in the XA tag for
reads
paired
properly. If a read has more than
INT hits, the XA tag will not be
written. [3]
-r
STR
Specify the read group
in a format like
‘@RG
tID:foo
tSM:bar’. [null]
sampe
bwa
sampe
[-a
maxInsSize]
[-o
maxOcc]
[-n
maxHitPaired]
[-N
maxHitDis]
[-P]
<>
<>
<>
<>
<> > <>
Generate alignments in the SAM format
given paired-end reads. Repetitive read pairs will
be placed randomly.
OPTIONS:
-a
INT
Maximum insert size for
a read pair to be considered being mapped
properly. Since 0.4.5, this option is
only used when there are not enough
good alignment to infer the distribution of insert
sizes. [500]
-o
INT
Maximum occurrences of
a read for pairing. A read with more occurrneces
will be treated as a single-end
read.
Reducing this parameter helps faster pairing.
[100000]
-P
Load
the
entire
FM-index
into
memory
to
reduce
disk
operations
(base-space
reads
only).
With
this
option,
at least 1.25N bytes
of memory are required, where N is the length of
the genome.
-n
INT
Maximum number of
alignments to output in the XA tag for reads
paired properly. If a read has more than
INT hits, the XA tag will not be
written. [3]
-N
INT
Maximum
number
of
alignments
to
output
in
the
XA
tag
for
disconcordant
read
pairs
(excluding
singletons).
If a read has
more than INT hits, the XA tag will not be
written. [10]
-r
STR
Specify the read group
in a format like
‘@RG
tID:foo
tSM:bar’. [null]
bwasw
bwa
bwasw
[-a
matchScore]
[-b
mmPen]
[-q
gapOpenPen]
[-r
gapExtPen]
[-t
nThreads]
[-w
bandWidth]
[-T
thres]
[-s
hspIntv]
[-z zBest] [-N
nHspRev] [-c thresCoef] <> <>
Align
query sequences in the <> file.
OPTIONS:
-a
INT
Score of a match [1]
-b
INT
Mismatch penalty [3]
-q
INT
Gap open penalty [5]
-r
INT
Gap extension penalty. The penalty for
a contiguous gap of size k is q+k*r. [2]
-t
INT
Number of threads in the multi-
threading mode [1]
-w
INT
Band width in the
banded alignment [33]
-T
INT
Minimum score threshold
divided by a [37]
-c
FLOAT
Coefficient for
threshold adjustment according to query length.
Given an l-long query, the threshold
for a hit to be retained is
a*max{T,c*log(l)}. [5.5]
-z
INT
Z-best heuristics.
Higher -z increases accuracy at the cost of speed.
[1]
-s
INT
Maximum SA interval size for initiating
a seed. Higher -s increases accuracy at the cost
of speed. [3]
-N
INT
Minimum number of seeds
supporting the resultant alignment to skip reverse
alignment. [5]
SAM ALIGNMENT
FORMAT
The output of the
‘aln’
command is binary and
designed for BWA use only. BWA outputs the final
alignment in the SAM (Sequence
Alignment/Map) format. Each line
consists of:
Col
Field
Description