October 9, 2025
In this interactive article, we explore the borderline-magical algorithm known as the Burrows-Wheeler Transform (BWT). It
powers data compression in bzip2, and is used by sequence alignment tools like bowtie and bwa, both of which were named after the algorithm.
The BWT has 2 key properties:
coconut is tooccun.Before we dive in, you should know that the BWT has a third, unofficial property: it is not intuitive. Many of the steps in the algorithm will seem arbitrary and it might not be clear why you're even doing them. I'm hoping this article helps you build some intuition around the BWT.
To apply the BWT on a string like , there are 3 steps to follow:
Write down all rotations
banana$Sort strings (row-wise)
$bananaThe BWT is the last column
annb$aaThe $ marks the end of the string, and is needed to make the BWT reversible. Without that marker, you could still
regenerate the matrix in Step , but you wouldn't know which row contains the original string. If it's
an English word, you might guess it's banana and not nabana, but that's harder to do with DNA because most rotations will look reasonable.
In Step , the sorting causes rows that start the same to be more likely to be grouped together. As a result, the character that comes right before (i.e. the character in the last column) is also likely to be similar, based on repeated patterns in the English language, and also in DNA sequences!
For example, in the BWT of ,
the letter c is grouped because it's always followed by an o. Although o is followed by
either c or n, it still clusters in the BWT because its corresponding rows end up being sorted next
to each other.
If there was a row in Step that started with a letter in between c and n, the o's would no longer cluster. For example, what happens to the o's if you add an i to coconut: . Would the o's
cluster if you tried ?
Now it's your turn: try encoding your name or a repetitive string. Which characters can you add or remove to make the BWT cluster more or less?
Given the encoded string, we can reconstruct the matrix from Step as follows: Start with an empty matrix, prepend the BWT string, sort the strings, and repeat until the matrix is filled. Keep clicking Next below until you reconstruct the matrix; the matrix on the right shows the final answer we're working towards.
Current BWT matrix:
Step 0:
Expected matrix:
$bananaOnce the matrix is filled, we can read off the string from any of the rows since we have the $ marker.
Starting from an empty matrix, note that adding the BWT column and sorting gives you the first column of the expected BWT matrix. If you then prepend and sort a second time, you now have the first two columns of the BWT matrix. You can keep going to recreate the whole matrix.
To understand why this works, let's consider a scenario where I give you the first 2 columns of the BWT matrix and ask you to figure out the rest. Remember that the BWT is the last column of the matrix, i.e. the characters that come right before the first column. So by prepending the BWT to the first column, we're still preserving the relationships between the substrings we reconstructed so far (you can imagine the BWT matrix rotates on itself to connect the first and last column together). Then, sorting the current set of substrings gives us the first 3 columns of the BWT matrix.
So far, we've seen how to use the Burrows-Wheeler Transform to encode and decode strings. That's nice and all, but how can we use the BWT for sequence alignment, i.e. looking for a small string in a much larger string?
To do that, I first need to introduce yet another magical property of the BWT: Last-to-first Mapping.
This property states that the order in which you see a letter in the first column is the same order in which you see it in the last column!
Let's consider the word banana: if we annotate each letter with the number of times it occurs in the string
before creating the BWT matrix, the letter a appears in the same order in both the first and last column: a2, a1, a0!
With that in mind, let's find all occurences of the pattern an within banana, using only the first
and last columns. Let's begin by finding rows that start with the last character of the pattern (i.e. n)—you'll see why in a second:
Now that we have an n in the first column, we know that the last column is the character that comes right before n, so we can look for an a in that last column. We find two matches: a1 and a0, so we can visit rows that have those characters in the first column:
And voilà, the highlighted rows both start with an, representing the matches banana and banana.
A few sections ago, we decoded the BWT string by recreating the entire BWT matrix, which was a lot of work. Could we instead use this LF property to decode the BWT string? As a matter of fact, we can!
You can think of decoding the BWT string as a special case of searching, where we're looking for whichever string ends with $. So we can start by finding the $ character, then hop around between the first and column until we find the $ once more and we'll have recreated the reverse of the original string.
If you somehow made it all the way here (let me know at robert@omgenomics.com), and you can't get enough of the BWT, there's a lot more you can learn about:
n has n rotations, so sorting that list of strings has a time
complexity of O(n) rotations * O(n log n) comparisons = O(n2 log n). There's an interesting data structure
called a Suffix Array that you can use to more efficiently generate that matrix. You can learn more about that from Ben
Langmead's lecture notes about suffix arrays and BWT and the FM index.
Ben's lab maintains the bowtie2 sequence aligner,
so his slides are a great in-depth resource.✨ Thanks to Ben Langmead, Niema Moshiri, Maria Nattestad, and Zamin Iqbal for their insightful feedback on this article.