Downsample data
If you’re testing a new tool or writing a new algorithm, working with large data files can slow you down because long runtimes make it difficult to iterate quickly.
This is where downsampling (or subsampling) comes in. In the previous step, we saw that hairpins.fa has ~3.1K sequences. Let’s sample 10% of that file and save it as its own FASTA file:
Instead of a fraction, to obtain a number of sequences sampled, use the --number flag:
⚠️ Depending on the random seed, you may not always obtain exactly the number of sequences requested. For example:
See the SeqKit manual for details on why that is.