Burrows-Wheeler Transform (BWT) interactive introduction

The Burrows-Wheeler Transform

October 9, 2025

In this interactive article, we explore the borderline-magical algorithm known as the Burrows-Wheeler Transform (BWT). It powers data compression in bzip2, and is used by sequence alignment tools like bowtie and bwa, both of which were named after the algorithm.

The BWT has 2 key properties:

It shuffles strings such that identical letters tend to be grouped together, e.g. the BWT of coconut is tooccun.
The transform can be reversed to get the original string back, which is what makes it useful.

Before we dive in, you should know that the BWT has a third, unofficial property: it is not intuitive. Many of the steps in the algorithm will seem arbitrary and it might not be clear why you're even doing them. I'm hoping this article helps you build some intuition around the BWT.

The BWT algorithm

To apply the BWT on a string like , there are 3 steps to follow:

String to encode:

Write down all rotations

banana$
anana$b
nana$ba
ana$ban
na$bana
a$banan
$banana

Sort strings (row-wise)

$banana
a$banan
ana$ban
anana$b
banana$
na$bana
nana$ba

The BWT is the last column

annb$aa

What's the dollar sign for?

The $ marks the end of the string, and is needed to make the BWT reversible. Without that marker, you could still regenerate the matrix in Step , but you wouldn't know which row contains the original string. If it's an English word, you might guess it's banana and not nabana, but that's harder to do with DNA because most rotations will look reasonable.

Intuition behind the BWT

In Step , the sorting causes rows that start the same to be more likely to be grouped together. As a result, the character that comes right before (i.e. the character in the last column) is also likely to be similar, based on repeated patterns in the English language, and also in DNA sequences!

For example, in the BWT of , the letter c is grouped because it's always followed by an o. Although o is followed by either c or n, it still clusters in the BWT because its corresponding rows end up being sorted next to each other.

If there was a row in Step that started with a letter in between c and n, the o's would no longer cluster. For example, what happens to the o's if you add an i to coconut: . Would the o's cluster if you tried ?

Now it's your turn: try encoding your name or a repetitive string. Which characters can you add or remove to make the BWT cluster more or less?

Decoding the BWT

Given the encoded string, we can reconstruct the matrix from Step as follows: Start with an empty matrix, prepend the BWT string, sort the strings, and repeat until the matrix is filled. Keep clicking Next below until you reconstruct the matrix; the matrix on the right shows the final answer we're working towards.

Current BWT matrix:

Step 0:

Slow Fast

Expected matrix:

$banana
a$banan
ana$ban
anana$b
banana$
na$bana
nana$ba

Once the matrix is filled, we can read off the string from any of the rows since we have the $ marker.

Intuition behind the decoding algorithm

Starting from an empty matrix, note that adding the BWT column and sorting gives you the first column of the expected BWT matrix. If you then prepend and sort a second time, you now have the first two columns of the BWT matrix. You can keep going to recreate the whole matrix.

To understand why this works, let's consider a scenario where I give you the first 2 columns of the BWT matrix and ask you to figure out the rest. Remember that the BWT is the last column of the matrix, i.e. the characters that come right before the first column. So by prepending the BWT to the first column, we're still preserving the relationships between the substrings we reconstructed so far (you can imagine the BWT matrix rotates on itself to connect the first and last column together). Then, sorting the current set of substrings gives us the first 3 columns of the BWT matrix.

How to use BWT for sequence alignment

So far, we've seen how to use the Burrows-Wheeler Transform to encode and decode strings. That's nice and all, but how can we use the BWT for sequence alignment, i.e. looking for a small string in a much larger string?

To do that, I first need to introduce yet another magical property of the BWT: Last-to-first Mapping.

Last-to-first Mapping property

This property states that the order in which you see a letter in the first column is the same order in which you see it in the last column!

Let's consider the word banana: if we annotate each letter with the number of times it occurs in the string before creating the BWT matrix, the letter a appears in the same order in both the first and last column: a₂, a₁, a₀!

$_undefinedb_undefineda_undefinedn_undefineda_undefinedn_undefineda_undefined
a_undefined$_undefinedb_undefineda_undefinedn_undefineda_undefinedn_undefined
a_undefinedn_undefineda_undefined$_undefinedb_undefineda_undefinedn_undefined
a_undefinedn_undefineda_undefinedn_undefineda_undefined$_undefinedb_undefined
b_undefineda_undefinedn_undefineda_undefinedn_undefineda_undefined$_undefined
n_undefineda_undefined$_undefinedb_undefineda_undefinedn_undefineda_undefined
n_undefineda_undefinedn_undefineda_undefined$_undefinedb_undefineda_undefined

Find a substring

With that in mind, let's find all occurences of the pattern an within banana, using only the first and last columns. Let's begin by finding rows that start with the last character of the pattern (i.e. n)—you'll see why in a second:

Now that we have an n in the first column, we know that the last column is the character that comes right before n, so we can look for an a in that last column. We find two matches: a₁ and a₀, so we can visit rows that have those characters in the first column:

And voilà, the highlighted rows both start with an, representing the matches banana and banana.

Wait a minute...

A few sections ago, we decoded the BWT string by recreating the entire BWT matrix, which was a lot of work. Could we instead use this LF property to decode the BWT string? As a matter of fact, we can!

You can think of decoding the BWT string as a special case of searching, where we're looking for whichever string ends with $. So we can start by finding the $ character, then hop around between the first and column until we find the $ once more and we'll have recreated the reverse of the original string.

What's next?

If you somehow made it all the way here (let me know at robert@omgenomics.com), and you can't get enough of the BWT, there's a lot more you can learn about:

Suffix Arrays: It turns out the way we generated the BWT transform above was quite inefficient. A string of length n has n rotations, so sorting that list of strings has a time complexity of O(n) rotations * O(n log n) comparisons = O(n² log n). There's an interesting data structure called a Suffix Array that you can use to more efficiently generate that matrix. You can learn more about that from Ben Langmead's lecture notes about suffix arrays and BWT and the FM index. Ben's lab maintains the bowtie2 sequence aligner, so his slides are a great in-depth resource.
FM Index: In the examples above, the searches were small enough that we hopped between the first and last column by eye by looking at every single row. But making millions of queries within a large string like the human genome with 3 billion basepairs would be too slow. If you want to learn about how to mitigate those issues in practice, check out Ben Langmead's lecture notes on the FM index.
Compression: As I mentioned above, BWT helps with compression because of how it tends to cluster characters together. You can learn more about compression from Carl Kingsford's lecture notes on BWT and compression.

✨ Thanks to Ben Langmead, Niema Moshiri, Maria Nattestad, and Zamin Iqbal for their insightful feedback on this article.