A reproducible benchmark harness for sentence-embedding throughput across the popular Rust ML libraries, using sentence-transformers/all-MiniLM-L6-v2 as the common model.
Compares ort (ONNX Runtime), fastembed, candle, ollama (HTTP), and llama-cpp-2 (in-process llama.cpp). Sweeps sequence length, batch size, thread count, HTTP concurrency, and GPU offload. Every backend's output is validated against a Python sentence-transformers reference via cosine similarity before its latency numbers are trusted.
Full methodology and findings: REPORT.md.
On an Apple M4 Max, fp32-or-equivalent precision, single-machine benchmark:
| Backend | b=1 short (ms p50) | b=32 short (eps) | b=32 long (eps) |
|---|---|---|---|
| llama-cpp-2 (Metal) | 1.26 | 9676 | 948 |
| ort fp32 (t=8) | 1.10 | 3052 | 261 |
| fastembed | 1.71 | 2849 | 315 |
| candle (t=8) | 8.15 | 603 | 31 |
| ollama (HTTP, default) | 11.05 | 433 | 223 |
| ollama (b=8, concurrency=16) | 19.7 | 553 | 260 |
All backends validated at cosine ≥ 0.999 vs Python reference, except naive dynamic int8 quantization (cosine 0.961).
Headline observations:
- Single-query latency: ort fp32 or llama-cpp-2, both around 1–2 ms p50 across all sequence lengths.
- High-throughput indexing: llama-cpp-2 with the
metalfeature flag. ~9.7k embeddings per second on short text. - Threading parity matters. A lot of "library X is much faster than library Y" comparisons resolve to "X auto-detected cores; Y honored
--threads 1." - Ollama's HTTP layer is most of the cost for single-query workloads. The same F16 GGUF loaded in-process via llama-cpp-2 is roughly 8x faster.
- Dynamic int8 quantization was a wash on Apple Silicon. Same speed or slower than fp32 ONNX, cosine drops to 0.96. May be different on x86 with VNNI.
Prerequisites: Rust toolchain, uv (or any way to make a Python 3.12 venv), CMake (build dep for llama-cpp-2). Optional: a local Ollama daemon if you want to benchmark it.
git clone https://github.com/jerrythomas/rust-embedding-bench.git
cd rust-embedding-bench
make benchmake help lists the other targets (build, sweep, aggregate, correctness, clean, nuke). The pipeline is idempotent. On a clean checkout it will:
- Create
.venvand install Python deps (sentence-transformers, optimum, numpy) - Pull
all-minilminto Ollama (if the daemon is reachable; otherwise the ollama backend is skipped) - Build all Rust runners in release mode
- Export the ONNX model and dynamic-int8-quantize it
- Generate Python reference embeddings
- Sweep every backend × length × batch × threads combination
- Run a correctness check (cosine vs reference) per backend
- Print an aggregated comparison table
Cold first run: ~10 minutes (model downloads + cargo build). Warm reruns: ~2–3 minutes.
Backends
| Backend | Engine | Model file |
|---|---|---|
| fastembed | ONNX Runtime via ort |
Qdrant pre-optimized fp32 ONNX (downloaded) |
| ort | ONNX Runtime | Optimum-exported fp32 ONNX |
| ort-qdrant | ONNX Runtime | Same model fastembed uses, called from our wrapper |
| ort-int8 | ONNX Runtime | Dynamic int8 quantization of the fp32 ONNX |
| candle | candle (pure Rust) | fp32 safetensors from HF |
| ollama | llama.cpp via HTTP daemon | F16 GGUF |
| llama-cpp-2 | llama.cpp in-process | Same F16 GGUF; CPU build by default, Metal optional |
Sweep dimensions
- Sequence length: short (5–20 tokens) / medium (50–100) / long (200–256)
- Batch size: 1, 8, 32
- Thread count: 1, 4, 8 (controlled per-runner; passed via env to libraries that honor
RAYON_NUM_THREADS/OMP_NUM_THREADS) - Ollama HTTP concurrency: 1, 4, 8, 16 (parallel in-flight requests)
- llama-cpp-2 GPU offload: enable via
features = ["metal"]inrunners/llama_runner/Cargo.toml
The sweep is configured via environment variables or make VAR=.... Examples:
# Only fastembed and ort, short text, batch=32, all thread configs
make bench SKIP="candle ollama llama" LENGTHS=short BATCHES=32 THREADS="1 4 8"
# Fast smoke test
make bench WARMUP=20 MEASURE=100
# Remote Ollama
OLLAMA_HOST=http://other-host:11434 make bench| Variable | Default | Effect |
|---|---|---|
SKIP |
empty | Space-separated backends to skip (e.g. "ollama ort") |
LENGTHS |
short medium long |
Length buckets to sweep |
BATCHES |
1 8 32 |
Batch sizes to sweep |
THREADS |
1 |
Thread counts to sweep |
WARMUP |
50 | Warmup embeddings per run (discarded) |
MEASURE |
500 | Measured embeddings per run |
OLLAMA_HOST |
http://localhost:11434 |
Ollama daemon URL |
Individual runners can also be invoked directly:
./target/release/ort_runner --length medium --batch 32 --threads 8 \
--warmup 50 --measure 500 --out my_run.json.
├── Makefile # entry point (`make bench`, `make help`, ...)
├── REPORT.md # methodology + full results
├── LICENSE # MIT
├── Cargo.toml # Rust workspace
├── corpus/sentences.json # 29 deterministic test sentences
├── reference/
│ ├── generate_reference.py # Python sentence-transformers reference
│ ├── export_onnx.py # ONNX export via optimum
│ ├── quantize.py # dynamic int8 quantization
│ ├── download_qdrant.py # Qdrant pre-optimized ONNX
│ └── download_gguf.py # F16 GGUF (Ollama cache or HF Hub)
├── runners/
│ ├── shared/ # common CLI args, result schema, IO helpers
│ ├── fastembed_runner/
│ ├── ort_runner/ # supports --model, --tokenizer, --backend-label
│ ├── candle_runner/
│ ├── ollama_runner/ # supports --concurrency
│ └── llama_runner/ # in-process llama-cpp-2
└── analyze/
├── compare.py # aggregate JSON results into a table
└── correctness.py # cosine similarity vs Python reference
Each runner is a small (~150 line) Rust binary that:
- Parses
shared::CommonArgs(corpus, length, batch, threads, warmup, measure, out, save_vectors) - Loads the model once at startup
- Runs a warmup loop
- Runs a measurement loop, tracking per-item latency
- Writes a
BenchResultJSON viashared::write_result
To add a new library foo:
mkdir -p runners/foo_runner/src
cp runners/fastembed_runner/Cargo.toml runners/foo_runner/Cargo.toml
cp runners/fastembed_runner/src/main.rs runners/foo_runner/src/main.rs
# Edit the dependencies and the embed loop. Add foo_runner to the workspace
# members in the root Cargo.toml. Add "foo" to the BACKENDS list in the Makefile.
make benchRun with --save-vectors vectors/foo_short.bin once and let the harness compare them to the reference vectors. The correctness check catches most "the embedding pipeline is wired wrong" bugs before any latency claim becomes load-bearing.
The numbers in this README and in REPORT.md were collected on a MacBook Pro with an Apple M4 Max (12 performance + 4 efficiency cores), 48 GB unified memory, macOS 26.3.1. Results on different hardware will differ. In particular:
- x86 with AVX-512 VNNI is expected to do considerably better with int8 (the AVX-512 dot-product path is a real fast track that ARM NEON lacks an equivalent for).
- NVIDIA GPU support is available in
ortandllama-cpp-2via feature flags, but is not exercised here. - ARM NEON fp32 throughput is very competitive on Apple Silicon, narrowing the typical fp32-vs-int8 gap seen on x86.
If you re-run this on different hardware, the harness will produce JSON output you can drop into REPORT.md-style tables.
MIT.