So far, we counted k-mers on entire genomes. In this section, we will count k-mers of sequencing reads and see if we can classify them to one or the other genome.
Let’s simulate some sequencing reads from Dengue using wgsim:
Now let’s count k-mers in the sequencing reads, but only if they are also found in the Dengue genome (we’ll ignore R2 reads):
Note that we’re using -C, which means Jellyfish will only count canonical k-mers.
For example, the k-mers ACCT and AGGT are reverse complements of each other, so their counts are grouped under the so-called canonical k-mer ACCT, because it comes first alphabetically.
Note that we did not use -C earlier when counting k-mers on genomes. The difference is: when sequencing DNA, we can’t differentiate which of the 2 strands the read comes from.
Next, let’s count k-mers from Dengue sequencing reads that are also found in the Chikungunya genome:
Looking at the distribution of k-mers, there are thousands more k-mers from Chikungunya that don’t show up in the sequencing reads:
This is a simplistic example but illustrates how k-mers and their distribution can be a useful way to summarize sequencing reads for classification.