Genomic ranges

Understand the genome's coordinate system

By Robert Aboukhalil

June 10, 2025

Ranges are everywhere in genomics. Typically represented as chromosome:start-end, they track where interesting things are found (exons, protein binding sites, etc.), or how metrics change along the genome (GC content, copy-number, etc.). Here we explore how ranges work, and how to analyze them on the command-line with bedtools.

Merge overlapping ranges

Let's consider a simple example with 3 ranges:

Chrom
Start
End
chr1
20
120
chr1
100
150
chr1
180
280
Try moving and resizing the ranges!

To the right of the visualization Under the visualization is a BED file, a tab-separated file for storing ranges. BED files must have the 3 columns shown, but can also include extra columns. A popular tool for wrangling BED files is bedtools, which we use here.

For example, these ranges could represent the locations of protein binding sites from a ChIP-seq experiment. To get a concise representation, let's merge overlapping ranges with bedtools merge:

Output of bedtools merge -i chipseq.bed:
Chrom
Start
End
Modify the ranges at the top to update this visualization.
What happens when all the ranges overlap?

Merging doesn't have to be all or nothing: you can merge nearby ranges that might be biologically related. Let's merge ranges within basepairs of each other using the parameter -d:

Output of bedtools merge -i chipseq.bed -d 30:
Chrom
Start
End
Tweak the value of -d to see how it impacts whether ranges are combined or not.
Get notified when new sandbox.bio tutorials are released:

Intersect overlapping ranges

Another very common operation is to intersect two sets of ranges to get shared ranges. Here we have 2 BED files: exons.bed contains the locations of exons, and cpg.bed contains the locations of CpG islands.

To find the CpG islands that overlap exons, we intersect those ranges using bedtools intersect:

Output of bedtools intersect -a exons.bed -b cpg.bed:
Input A:
exons.bed
Input B:
cpg.bed
Output:
Try tweaking the ranges to get more intersections in the output.

As shown above, where Input A intersects Input B, bedtools intersect returns the portion of A that overlaps B. In some cases, you might instead want to filter down Input A by keeping only the ranges that intersect Input B. This is where the flags -wa and -v come in.

To find exons that overlap with CpG islands and output original ranges, use the flag
. And to return the ranges in Input A that don't intersect, use the flag
:
Output of bedtools intersect -a exons.bed -b cpg.bed:
Try moving the ranges so that one range from exons.bed overlaps 2 ranges in cpg.bed. What do you notice about the output?
Can you think of a bedtools command that would remove duplicate ranges?

Calculate genome coverage

Another common operation is to calculate coverage, that is, at every position in the genome, how many ranges are found at that position? Say you ran a sequencing experiment, mapped the reads to the genome, and obtained the result shown below. We can use bedtools genomecov to create a histogram of the number of bases that are covered by 0, 1, 2, or 3 ranges:

Output of bedtools genomecov -i reads.bed -g genome.txt
How can you modify the ranges to hide the histogram bar for 0 coverage?

This command needs to know the size of each chromosome so it can count the positions with no coverage—we stored that information in genome.txt. For simplicity, we used a .bed file as the input, but bedtools also supports .bam files, in which case you don't need genome.txt because the .bam has that information already in the header.

Also, keep in mind that bedtools outputs a text file, not a pretty histogram. Use the embedded terminal to see what that output looks like and how you can interpret it as a histogram . Use the manual to understand what the last 2 columns of the output represent.

Pitfalls

This article wouldn't be complete if I didn't mention the ways bioinformatics conspires to make you question your life choices. When it comes to genomic ranges specifically, keep in mind that:

  • BED files use 0-based indexing, meaning chromosome coordinates range from 0 to chromosome length - 1. Just to keep it interesting, some formats like VCF use 1-based indexing.
  • Coordinates exist within the context of a reference genome. When you're running range operations on multiple BED files (like intersect), make sure they all use coordinates from the same reference genome! The changes between 2 genome versions might be so small that you might not notice.
  • If your BED files contain forward/reverse strand information (column 6) and you don't want to merge/intersect ranges on different strands, make sure to use the flag -s in your bedtools commands.
  • If you find yourself modifying a BED file manually (not that I've ever done that myself), make sure your IDE doesn't helpfully change your tabs into spaces on one of the lines, because that renders the file incorrect, and software like bedtools will give you an error. You can use tools like bedqc to validate your BED files.

What's next?

Here we covered commonly used operations on ranges, but there are many, many more. You don't need to know all of them, but it's good to briefly browse the list of bedtools commands, just so you know what's possible for future reference. You'll be surprised by how much bedtools can do.
Another way to explore this topic more is to use the terminal below and the bedtools manual to explore the following questions:
  • Merging ranges is nice and all, but how would you go about also counting how many intervals were merged? BED files usually have more columns than we did here, but in this case, choosing any column number works.
  • When using bedtools intersect, how would you intersect ranges only if they overlap by a significant amount, say 50%?
  • What happens if you run bedtools intersect with the flag -wb? How does the output look different?
  • How would you use bedtools to get a list of regions along the genome where no ranges exist? Consult the manual to see which command could help.
Terminal
Loading...