A SIMD-accelerated library to compute random minimizers.
It can compute all the minimizers of a human genome in 4 seconds using a single thread. It also provides a canonical version that ensures that a sequence and its reverse-complement always select the same positions, which takes 6 seconds on a human genome.
This crate builds on packed_seq and
seq-hash.
The underlying algorithm is described in the following paper:
- SimdMinimizers: Computing random minimizers, fast. Ragnar Groot Koerkamp, Igor Martayan SEA 2025 doi.org/10.4230/LIPIcs.SEA.2025.20
This library requires AVX2 or NEON instruction sets, which, on x64, requires
either target-cpu=native or target-cpu=x86-64-v3.
See this README for details and this
blog for background.
The same restrictions apply when using simd-minimizers in a larger project.
RUSTFLAGS="-C target-cpu=native" cargo run --releaseFull documentation can be found on docs.rs.
use packed_seq::{PackedSeqVec, SeqVec};
let seq = b"ACGTGCTCAGAGACTCAGAGGA";
let packed_seq = PackedSeqVec::from_ascii(seq);
let k = 5;
let w = 7;
let hasher = <seq_hash::NtHasher>::new(k);
// Simple usage with default hasher, returning only positions.
let minimizer_positions = canonical_minimizer_positions(packed_seq.as_slice(), k, w);
assert_eq!(minimizer_positions, vec![0, 7, 9, 15]);
// Advanced usage with custom hasher, super-kmer positions, and minimizer values as well.
let mut minimizer_positions = Vec::new();
let mut super_kmers = Vec::new();
let minimizer_vals: Vec<u64> = canonical_minimizers(k, w)
.hasher(&hasher)
.super_kmers(&mut super_kmers)
.run(packed_seq.as_slice(), &mut minimizer_positions)
.values_u64()
.collect();Benchmarks can be found in the bench directory in the GitHub repository.
bench/benches/bench.rs contains benchmarks used in this blogpost.
bench/src/bin/paper.rs contains benchmarks used in the paper.
Note that the benchmarks require some nightly features, you can install the latest nightly version with
rustup install nightlyTo replicate results from the paper, go into bench and run
RUSTFLAGS="-C target-cpu=native" cargo +nightly run --release
python eval.pyThe human genome we use is from the T2T consortium, and available by following the first link here.