Wrangle FASTA and FASTQ with SeqKit

Help 3 / 11

Getting Started

Downsample data

If you’re testing a new tool or writing a new algorithm, working with large data files can slow you down because long runtimes make it difficult to iterate quickly.

This is where downsampling (or subsampling) comes in. In the previous step, we saw that hairpins.fa has ~3.1K sequences. Let’s sample 10% of that file and save it as its own FASTA file:

seqkit sample --proportion 0.1 hairpins.fa > sampled.fa

Instead of a fraction, to obtain a number of sequences sampled, use the --number flag:

seqkit sample --number 10 hairpins.fa > sampled.fa

⚠️ Depending on the random seed, you may not always obtain exactly the number of sequences requested. For example:

seqkit sample --number 10 --rand-seed 123 hairpins.fa > sampled.fa

See the SeqKit manual for details on why that is.

← Previous Next →

Memory usage

Import

Export

Reset your sandbox

Help