B4/rhash#12359
Conversation
This patch series introduces BPF_MAP_TYPE_RHASH, a new hash map type that
leverages the kernel's rhashtable to provide resizable hash map for BPF.
The existing BPF_MAP_TYPE_HASH uses a fixed number of buckets determined at
map creation time. While this works well for many use cases, it presents
challenges when:
1. The number of elements is unknown at creation time
2. The element count varies significantly during runtime
3. Memory efficiency is important (over-provisioning wastes memory,
under-provisioning hurts performance)
BPF_MAP_TYPE_RHASH addresses these issues by using rhashtable, which
automatically grows and shrinks based on load factor.
The implementation wraps the kernel's rhashtable with BPF map operations:
- Uses bpf_mem_alloc for RCU-safe memory management
- Supports all standard map operations (lookup, update, delete, get_next_key)
- Supports batch operations (lookup_batch, lookup_and_delete_batch)
- Supports BPF iterators for traversal
- Supports BPF_F_LOCK for spin locks in values
- Requires BPF_F_NO_PREALLOC flag (elements allocated on demand)
- In-place updates for improved performance.
- max_entries serves as a hard limit, not bucket count
- Uses bit_spin_lock() + local_irq_save() for bucket locking,
similar to existing BPF hashmap's raw_spin_lock_irqsave(), insertions and
deletes may fail.
- Iterations are best-effort, if resize, insertions, deletions take place
concurrently, iterations may visit same elements multiple times or skip
elements.
- Lock out insertions, when running special fields destructor to guarantee
its completion.
The series includes comprehensive tests:
- Basic operations in test_maps (lookup, update, delete, get_next_key)
- BPF program tests for lookup/update/delete semantics
- Seq file tests
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
---
Update implementation
---------------------
Current implementation of the BPF_MAP_TYPE_RHASH does not provide
the same strong guarantees on the values consistency under concurrent
reads/writes as BPF_MAP_TYPE_HASH.
BPF_MAP_TYPE_HASH allocates a new element and atomically swaps the
pointer. BPF_MAP_TYPE_RHASH does memcpy in place with no lock held.
rhash trades consistency for speed, concurrent readers can observe
partially updated data. Two concurrent writers to the same key can
also interleave, producing mixed values. This is similar to arraymap
update implementation, including handling of the special fields.
As a solution, user may use BPF_F_LOCK to guarantee consistent reads
and write serialization.
Summary of the read consistency guarantees:
map type | write mechanism | read consistency
-------------+------------------+--------------------------
htab | alloc, swap ptr | always consistent (RCU)
htab F_LOCK | in-place + lock | consistent if reader locks
-------------+------------------+--------------------------
rhtab | in-place memcpy | torn reads
rhtab F_LOCK | in-place + lock | consistent if reader locks
Benchmarks
----------
1. LOOKUP (single producer, M events/sec)
key | max | nr | htab | rhtab | ratio | delta
----+-----+-------+---------+---------+-------+-------
8 | 1K | 750 | 99.85 | 81.92 | 0.82x | -18 %
8 | 1K | 1K | 100.71 | 80.19 | 0.80x | -20 %
8 | 1M | 750K | 23.37 | 72.09 | 3.08x | +208 %
8 | 1M | 1M | 13.39 | 53.72 | 4.01x | +301 %
32 | 1K | 750 | 51.57 | 42.78 | 0.83x | -17 %
32 | 1K | 1K | 50.81 | 45.83 | 0.90x | -10 %
32 | 1M | 750K | 11.27 | 15.29 | 1.36x | +36 %
32 | 1M | 1M | 7.32 | 8.75 | 1.19x | +19 %
256 | 1K | 750 | 7.58 | 7.88 | 1.04x | +4 %
256 | 1K | 1K | 7.43 | 7.81 | 1.05x | +5 %
256 | 1M | 750K | 3.69 | 4.27 | 1.16x | +16 %
256 | 1M | 1M | 2.60 | 3.12 | 1.20x | +20 %
Pattern:
* Small map (1K): htab wins for 8 / 32 byte keys by 10-20 %
because the preallocated bucket array fits in L1. Equalises
at 256 byte keys.
* Large map (1M): rhtab wins everywhere, up to 4x at high load
factor with 8 byte keys.
* Higher load factor amplifies rhtab's lead: rhtab grows the
bucket array; htab stays at user-declared max.
2. FULL UPDATE (M events/sec per producer, -p 7)
htab per-producer:
20.33 22.02 19.27 23.61 24.18 23.17 21.07
mean 21.94 range 19.27 - 24.18
rhtab per-producer:
133.51 129.47 74.52 129.29 102.26 129.98 107.64
mean 115.24 range 74.52 - 133.51
speedup (mean): 5.25x (+425 %)
In-place memcpy avoids the per-update alloc + RCU pointer swap
that htab pays.
3. MEMORY (overwrite, -p 8, no --preallocated)
value_size | htab ops/s | rhtab ops/s | htab mem | rhtab mem
-----------+-------------+-------------+----------+----------
32 B | 122.87 k/s | 133.04 k/s | 2.47 MiB | 2.49 MiB
4096 B | 64.43 k/s | 65.38 k/s | 6.74 MiB | 6.44 MiB
rhtab/htab : +8 % ops, +0.8 % mem (32 B)
+1 % ops, -4 % mem (4096 B)
SUMMARY
* Small / well-fitting map: htab is faster (cache-friendly
fixed bucket array), but only by ~10-20 %.
* Large / high-load-factor map: rhtab is dramatically faster
(1.2x to 4x) because rhashtable resizes to keep the load
factor sane while htab stays stuck at user-declared max.
* Update-heavy workloads: rhtab is ~5x faster per producer
via in-place memcpy.
* Memory benchmark: effectively on par
---
Changes in v7:
- rhashtable_next_key: move into lib/rhashtable.c, drop params argument
(Herbert).
- rhashtable_next_key: kdoc clarifies that behavior on tables with
duplicate keys is undefined (sashiko).
- rhashtable: include Herbert's "Use irq work for shrinking" patch so
__rhashtable_remove_fast_one() can fire the shrink path from NMI
context (Herbert).
- hashtab: fix u32 multiply overflow in __rhtab_map_lookup_and_delete_batch
copy_to_user; cast total to size_t before multiplying by key_size /
value_size (sashiko, bot+bpf-ci).
- hashtab: allow kptr/refcount fields in rhtab values (same model as
array map).
- Link to v6: https://patch.msgid.link/20260602-rhash-v6-0-1bfd35a4184f@meta.com
Changes in v6:
- rhashtable_next_key: advance past duplicate keys in the main bucket
chain to avoid an infinite loop when there are duplicate keys
(sashiko).
- rhashtable_next_key: return ERR_PTR(-EOPNOTSUPP) on rhltable (sashiko).
- rhashtable: selftest pre-sizes the table to avoid concurrent rehash
triggering spurious failures (sashiko).
- hashtab: real rhtab_map_mem_usage in the basic commit; move
bpf_map_free_internal_structs from rhtab_free_elem into the
special-fields commit where it does meaningful work (bot+bpf-ci).
- bpf_iter (seq_file): switch to rhashtable_walk_* for stronger
coverage under concurrent rehash; get_next_key and batch keep
rhashtable_next_key (sashiko).
- iter ops: rhtab_map_get_next_key adds IS_ERR check
before dereferencing the element pointer (sashiko).
- iter ops: bpf_each_rhash_elem removes cond_resched() (sashiko).
- iter ops: batch returns -EAGAIN (not -ENOENT) on cursor delete,
so userspace can distinguish lost cursor from end-of-iteration
and restart from NULL (sashiko).
- Link to v5: https://patch.msgid.link/20260528-rhash-v5-0-7205191b6c57@meta.com
Changes in v5:
- rhashtable_next_key: add kdoc WARNING to highlight lack of rehash
detection and unbounded iteration (Herbert).
- rhashtable: selftest now checks IS_ERR() before PTR_ERR comparison
on the missing-key path (bot+bpf-ci).
- hashtab: drop dead stub bodies and unused map_ops registrations
from the basic commit; iteration commit adds bodies, structs, and
registrations together. .map_get_next_key keeps a stub registration
in the basic commit because the syscall dispatcher does not
NULL-check it; iteration commit replaces the stub body with the
real implementation (bot+bpf-ci).
- hashtab: fix batch cursor advancement. v4 stashed the lookahead
element key but then resumed via next_key(cursor), skipping that
element across batch boundaries and orphaning it on
lookup_and_delete_batch. v5 stashes the lookahead key and looks
it up directly on the next batch entry (bot+bpf-ci, sashiko v3).
- hashtab: document torn-read race in rhtab_map_update_existing,
matching arraymap semantics (bot+bpf-ci).
- Link to v4: https://patch.msgid.link/20260513-rhash-v4-0-dd3d541ccb0b@meta.com
Changes in v4:
- rhashtable: introduce rhashtable_next_key(), drop walker-based
iteration for BPF (also drops earlier rhashtable_walk_enter_from()
proposal).
- map_extra: presize hint via lower 32 bits (nelem_hint), capped at
U16_MAX.
- Automatic shrinking enabled (was missing despite being advertised).
- Reject key_size > U16_MAX (rhashtable_params.key_len is u16).
- Replace irqs_disabled() guard with bpf_disable_instrumentation around
bucket-lock paths: closes same-CPU NMI tracing recursion without
rejecting legitimate IRQ-context callers.
- lookup_and_delete reordered: unlink before copy to avoid populating
user buffer on concurrent-unlink -ENOENT.
- update_existing reordered: copy then free_fields, matching arraymap.
- Word-sized key fast path (sizeof(long) bytes), inlined hashfn/cmpfn
via static-const rhashtable_params; works on both 32-bit and 64-bit.
- check_and_init_map_value() on insert (zero special-field bytes from
recycled bpf_mem_alloc memory; previously bpf_spin_lock could read
garbage and qspinlock would deadlock).
- BPF_SPIN_LOCK / BPF_RES_SPIN_LOCK allowlist moved to the special-
fields commit so each commit is bisect-safe.
- Link to v3: https://patch.msgid.link/20260424-rhash-v3-0-d0fa0ce4379b@meta.com
Changes in v3:
- Squash all commits implementing basic functions into one (Alexei)
- Remove selftests that were not necessary (Alexei)
- Resize detection for kernel full iterations, error out on resize (Alexei)
- Remove second lookup in get_next_key() (Emil)
- __acquires(RCU)/__releases(RCU) on seq_start/seq_stop (Emil)
- Use bpf_map_check_op_flags() where it makes sense (Leon)
- Benchmarks refresh, experiment with alternative hash functions
- Rely on iterator invalidation during rehash to handle table resizes:
fail on resize where we fully iterate on table inside kernel, dont fail on
resize where iteration goes through userspace. Exception -
rhtab_map_free_internal_structs() should be just safe to iterate fully
in kernel, no risk of infinite loop, because no user holding reference.
- Handle special fields during in-place updates (Emil, sashiko)
- Link to v2: https://lore.kernel.org/all/20260408-rhash-v2-0-3b3675da1f6e@meta.com/
Changes in v2:
- Added benchmarks
- Reworked all functions that walk the rhashtable, use walk API, instead
of directly accessing tbl and future_tbl
- Added rhashtable_walk_enter_from() into rhashtable to support O(1)
iteration continuations
- Link to v1: https://lore.kernel.org/r/20260205-rhash-v1-0-30dd6d63c462@meta.com
--- b4-submit-tracking ---
# This section is used internally by b4 prep for tracking purposes.
{
"series": {
"revision": 7,
"change-id": "20251103-rhash-7b70069923d8",
"prefixes": [
"bpf-next"
],
"history": {
"v1": [
"20260205-rhash-v1-0-30dd6d63c462@meta.com"
],
"v2": [
"20260408-rhash-v2-0-3b3675da1f6e@meta.com"
],
"v3": [
"20260424-rhash-v3-0-d0fa0ce4379b@meta.com"
],
"v4": [
"20260513-rhash-v4-0-dd3d541ccb0b@meta.com"
],
"v5": [
"20260528-rhash-v5-0-7205191b6c57@meta.com"
],
"v6": [
"20260602-rhash-v6-0-1bfd35a4184f@meta.com"
]
}
}
}
Introduce a simpler iteration mechanism for rhashtable that lets
the caller continue from an arbitrary position by supplying the
previous key, without the per-iterator state of the
rhashtable_walk_* API.
void *rhashtable_next_key(struct rhashtable *ht,
const void *prev_key);
Caller holds RCU; passes NULL prev_key for the first element or
the previously returned key to advance. Walks tbl->future_tbl
chain so in-flight rehashes are observed.
Best-effort: in case of concurrent resize, provides no guarantees:
- may produce duplicate elements
- may skip any amount of elements
- termination of the loop is not guaranteed in case of
sustained rehash. Callers are advised to bound loop externally
or avoid inserting new elements during such loop.
Returns ERR_PTR(-ENOENT) if prev_key is not found.
Behavior on tables with duplicate keys is undefined.
rhltable is not supported — returns ERR_PTR(-EOPNOTSUPP).
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Insert n elements, then verify: - NULL prev_key walks from the beginning, visiting all n - non-existing prev_key returns ERR_PTR(-ENOENT) Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Use irq work for automatic shrinking so that this may be called in NMI context. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Use rhashtable_lookup_likely() for lookups, rhashtable_remove_fast() for deletes, and rhashtable_lookup_get_insert_fast() for inserts. Updates modify values in place under RCU rather than allocating a new element and swapping the pointer (as regular htab does). This trades read consistency for performance: concurrent readers may see partial updates. BPF_F_LOCK support and special-field handling (timers, kptrs, etc.) follow in a later commit. Initialize rhashtable with bpf_mem_alloc element cache. Require BPF_F_NO_PREALLOC. Limit max_entries to 2^31. Free elements via rhashtable_free_and_destroy(). Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Implement get_next_key, batch lookup/lookup-and-delete, for_each_map_elem callback, and the seq_file BPF iterator for BPF_MAP_TYPE_RHASH. get_next_key() and batch use rhashtable_next_key() — stateless, matches the syscall UAPI shape (no kernel-side iterator state). get_next_key falls back to the first key when prev_key was concurrently deleted (matches htab semantics). Batch reports cursor loss as -EAGAIN so userspace can distinguish it from end-of-iteration (-ENOENT) and restart from NULL. The seq_file BPF iterator uses rhashtable_walk_* instead. It runs only from read() syscall context, so the walker's spin_lock is safe, and seq_file's per-fd state lets the walker handle rehash correctly (retry on -EAGAIN) for stronger coverage than the stateless API can provide. Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Add support for timers, workqueues, task work, spin locks and kptrs. Without this, users needing deferred callbacks, BPF_F_LOCK, or refcounted kernel pointers in a dynamically-sized map have no option - fixed-size htab is the only map supporting these field types. Resizable hashtab should offer the same capability. kptr semantics under in-place updates are identical to array map. Properly clean up BTF record fields on element delete and map teardown by wiring up bpf_obj_free_fields through a memory allocator destructor, matching the pattern used by htab for non-prealloc maps. Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Specialize the lookup/update/delete paths for keys whose size matches sizeof(long) (4 bytes on 32-bit, 8 bytes on 64-bit). A static-const rhashtable_params lets the compiler inline a custom XOR-fold hashfn and a single-word equality cmpfn, eliminating the indirect jhash dispatch. The same hashfn and cmpfn are installed into rhashtable's stored params at rhashtable_init time, so the rehash worker, slow-path inserts, and rhashtable_next_key() all agree with the inlined fast paths. The seq_file BPF iterator uses rhashtable_walk_* and is unaffected. Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Add BPF_MAP_TYPE_RHASH to libbpf's map type name table and feature probing so that libbpf-based tools can create and identify resizable hash maps. Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Test basic map operations (lookup, update, delete) for BPF_MAP_TYPE_RHASH including boundary conditions like duplicate key insertion and deletion of nonexistent keys. Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Test basic BPF iterator functionality for BPF_MAP_TYPE_RHASH, verifying all elements are visited. Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Make bpftool documentation aware of the resizable hash map. Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Support resizable hashmap in BPF map benchmarks.
1. LOOKUP (single producer, M events/sec)
key | max | nr | htab | rhtab | ratio | delta
----+-----+-------+---------+---------+-------+-------
8 | 1K | 750 | 99.85 | 81.92 | 0.82x | -18 %
8 | 1K | 1K | 100.71 | 80.19 | 0.80x | -20 %
8 | 1M | 750K | 23.37 | 72.09 | 3.08x | +208 %
8 | 1M | 1M | 13.39 | 53.72 | 4.01x | +301 %
32 | 1K | 750 | 51.57 | 42.78 | 0.83x | -17 %
32 | 1K | 1K | 50.81 | 45.83 | 0.90x | -10 %
32 | 1M | 750K | 11.27 | 15.29 | 1.36x | +36 %
32 | 1M | 1M | 7.32 | 8.75 | 1.19x | +19 %
256 | 1K | 750 | 7.58 | 7.88 | 1.04x | +4 %
256 | 1K | 1K | 7.43 | 7.81 | 1.05x | +5 %
256 | 1M | 750K | 3.69 | 4.27 | 1.16x | +16 %
256 | 1M | 1M | 2.60 | 3.12 | 1.20x | +20 %
Pattern:
* Small map (1K): htab wins for 8 / 32 byte keys by 10-20%
* Large map (1M): rhtab wins everywhere, up to 4x at high load
factor with 8 byte keys.
* Higher load factor amplifies rhtab's lead: rhtab grows the
bucket array; htab stays at user-declared max.
2. FULL UPDATE (M events/sec per producer)
htab per-producer:
20.33 22.02 19.27 23.61 24.18 23.17 21.07
mean 21.94 range 19.27 - 24.18
rhtab per-producer:
133.51 129.47 74.52 129.29 102.26 129.98 107.64
mean 115.24 range 74.52 - 133.51
speedup (mean): 5.25x (+425 %)
In-place memcpy avoids the per-update alloc + RCU pointer swap
that htab pays.
3. MEMORY
value_size | htab ops/s | rhtab ops/s | htab mem | rhtab mem
-----------+-------------+-------------+----------+----------
32 B | 122.87 k/s | 133.04 k/s | 2.47 MiB | 2.49 MiB
4096 B | 64.43 k/s | 65.38 k/s | 6.74 MiB | 6.44 MiB
rhtab/htab : +8 % ops, +0.8 % mem (32 B)
+1 % ops, -4 % mem (4096 B)
Throughput effectively tied
SUMMARY
* Small / well-fitting map: htab is faster (cache-friendly
fixed bucket array), but only by ~10-20 %.
* Large / high-load-factor map: rhtab is dramatically faster
(1.2x to 4x) because rhashtable resizes to keep the load
factor sane while htab stays stuck at user-declared max.
* Update-heavy workloads: rhtab is ~5x faster per producer
via in-place memcpy.
* Memory benchmark: effectively on par.
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
AI reviewed your patch. Please fix the bug or email reply why it's not a bug. In-Reply-To-Subject: |
AI reviewed your patch. Please fix the bug or email reply why it's not a bug. In-Reply-To-Subject: |
AI reviewed your patch. Please fix the bug or email reply why it's not a bug. In-Reply-To-Subject: |
AI reviewed your patch. Please fix the bug or email reply why it's not a bug. In-Reply-To-Subject: |
4f5632b to
970af1b
Compare
3a26044 to
818f7b1
Compare
No description provided.