June 10, 2025
Ranges are everywhere in genomics. Typically represented as chromosome:start-end, they track where interesting things are found (exons, protein binding sites, etc.), or how metrics change along the genome (GC content, copy-number, etc.). Here we explore how ranges work, and how
to analyze them on the command-line with bedtools.
Let's consider a simple example with 3 ranges:
To the right of the visualization Under the visualization is a BED file, a tab-separated file for storing ranges. BED files must have the 3 columns shown, but can also include extra columns. A popular tool for wrangling BED files is bedtools, which we use here.
For example, these ranges could represent the locations of protein binding sites from a ChIP-seq experiment. To get a concise
representation, let's merge overlapping ranges with bedtools merge:
Merging doesn't have to be all or nothing: you can merge nearby ranges that might be biologically related. Let's merge ranges
within basepairs of each other using the parameter -d:
-d to see how it impacts whether ranges are combined or not.Another very common operation is to intersect two sets of ranges to get shared ranges. Here we have 2 BED files: exons.bed contains the locations of exons, and cpg.bed contains the locations of CpG
islands.
To find the CpG islands that overlap exons, we intersect those ranges using bedtools intersect:
exons.bedcpg.bedAs shown above, where Input A intersects Input B, bedtools intersect returns the portion of A that overlaps B. In
some cases, you might instead want to filter down Input A by keeping only the ranges that intersect Input B. This is where the
flags -wa and -v come in.
exons.bed overlaps 2 ranges in cpg.bed. What do you notice about the output?bedtools command that would remove duplicate ranges?Another common operation is to calculate coverage, that is, at every position in the genome, how many ranges are found
at that position? Say you ran a sequencing experiment, mapped the reads to the genome, and obtained the result shown below. We
can use bedtools genomecov to create a histogram of the number of bases that are covered by 0, 1, 2, or 3 ranges:
bedtools genomecov -i reads.bed -g genome.txt This command needs to know the size of each chromosome so it can count the positions with no coverage—we stored that
information in genome.txt. For simplicity, we used a .bed file as the input, but bedtools also
supports .bam files, in which case you don't need genome.txt because the .bam has that information already in the header.
Also, keep in mind that bedtools outputs a text file, not a pretty histogram. Use the embedded terminal to see
what that output looks like and how you can interpret it as a histogram . Use the manual to understand what
the last 2 columns of the output represent.
This article wouldn't be complete if I didn't mention the ways bioinformatics conspires to make you question your life choices. When it comes to genomic ranges specifically, keep in mind that:
intersect), make sure they all use coordinates from the same reference genome! The
changes between 2 genome versions might be so small that you might not notice.-s in your bedtools commands.bedtools will give you an error. You can use tools like bedqc to validate your BED files.bedtools can do. bedtools intersect, how would you intersect ranges only if they overlap by a significant amount,
say 50%?bedtools intersect with the flag -wb? How does the output look
different?bedtools to get a list of regions along the genome where no ranges exist? Consult the manual
to see which command could help.