B4/rhash by mykyta5 · Pull Request #12359 · kernel-patches/bpf

mykyta5 · 2026-06-05T11:15:43Z

No description provided.

This patch series introduces BPF_MAP_TYPE_RHASH, a new hash map type that leverages the kernel's rhashtable to provide resizable hash map for BPF. The existing BPF_MAP_TYPE_HASH uses a fixed number of buckets determined at map creation time. While this works well for many use cases, it presents challenges when: 1. The number of elements is unknown at creation time 2. The element count varies significantly during runtime 3. Memory efficiency is important (over-provisioning wastes memory, under-provisioning hurts performance) BPF_MAP_TYPE_RHASH addresses these issues by using rhashtable, which automatically grows and shrinks based on load factor. The implementation wraps the kernel's rhashtable with BPF map operations: - Uses bpf_mem_alloc for RCU-safe memory management - Supports all standard map operations (lookup, update, delete, get_next_key) - Supports batch operations (lookup_batch, lookup_and_delete_batch) - Supports BPF iterators for traversal - Supports BPF_F_LOCK for spin locks in values - Requires BPF_F_NO_PREALLOC flag (elements allocated on demand) - In-place updates for improved performance. - max_entries serves as a hard limit, not bucket count - Uses bit_spin_lock() + local_irq_save() for bucket locking, similar to existing BPF hashmap's raw_spin_lock_irqsave(), insertions and deletes may fail. - Iterations are best-effort, if resize, insertions, deletions take place concurrently, iterations may visit same elements multiple times or skip elements. - Lock out insertions, when running special fields destructor to guarantee its completion. The series includes comprehensive tests: - Basic operations in test_maps (lookup, update, delete, get_next_key) - BPF program tests for lookup/update/delete semantics - Seq file tests Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> --- Update implementation --------------------- Current implementation of the BPF_MAP_TYPE_RHASH does not provide the same strong guarantees on the values consistency under concurrent reads/writes as BPF_MAP_TYPE_HASH. BPF_MAP_TYPE_HASH allocates a new element and atomically swaps the pointer. BPF_MAP_TYPE_RHASH does memcpy in place with no lock held. rhash trades consistency for speed, concurrent readers can observe partially updated data. Two concurrent writers to the same key can also interleave, producing mixed values. This is similar to arraymap update implementation, including handling of the special fields. As a solution, user may use BPF_F_LOCK to guarantee consistent reads and write serialization. Summary of the read consistency guarantees: map type | write mechanism | read consistency -------------+------------------+-------------------------- htab | alloc, swap ptr | always consistent (RCU) htab F_LOCK | in-place + lock | consistent if reader locks -------------+------------------+-------------------------- rhtab | in-place memcpy | torn reads rhtab F_LOCK | in-place + lock | consistent if reader locks Benchmarks ---------- 1. LOOKUP (single producer, M events/sec) key | max | nr | htab | rhtab | ratio | delta ----+-----+-------+---------+---------+-------+------- 8 | 1K | 750 | 99.85 | 81.92 | 0.82x | -18 % 8 | 1K | 1K | 100.71 | 80.19 | 0.80x | -20 % 8 | 1M | 750K | 23.37 | 72.09 | 3.08x | +208 % 8 | 1M | 1M | 13.39 | 53.72 | 4.01x | +301 % 32 | 1K | 750 | 51.57 | 42.78 | 0.83x | -17 % 32 | 1K | 1K | 50.81 | 45.83 | 0.90x | -10 % 32 | 1M | 750K | 11.27 | 15.29 | 1.36x | +36 % 32 | 1M | 1M | 7.32 | 8.75 | 1.19x | +19 % 256 | 1K | 750 | 7.58 | 7.88 | 1.04x | +4 % 256 | 1K | 1K | 7.43 | 7.81 | 1.05x | +5 % 256 | 1M | 750K | 3.69 | 4.27 | 1.16x | +16 % 256 | 1M | 1M | 2.60 | 3.12 | 1.20x | +20 % Pattern: * Small map (1K): htab wins for 8 / 32 byte keys by 10-20 % because the preallocated bucket array fits in L1. Equalises at 256 byte keys. * Large map (1M): rhtab wins everywhere, up to 4x at high load factor with 8 byte keys. * Higher load factor amplifies rhtab's lead: rhtab grows the bucket array; htab stays at user-declared max. 2. FULL UPDATE (M events/sec per producer, -p 7) htab per-producer: 20.33 22.02 19.27 23.61 24.18 23.17 21.07 mean 21.94 range 19.27 - 24.18 rhtab per-producer: 133.51 129.47 74.52 129.29 102.26 129.98 107.64 mean 115.24 range 74.52 - 133.51 speedup (mean): 5.25x (+425 %) In-place memcpy avoids the per-update alloc + RCU pointer swap that htab pays. 3. MEMORY (overwrite, -p 8, no --preallocated) value_size | htab ops/s | rhtab ops/s | htab mem | rhtab mem -----------+-------------+-------------+----------+---------- 32 B | 122.87 k/s | 133.04 k/s | 2.47 MiB | 2.49 MiB 4096 B | 64.43 k/s | 65.38 k/s | 6.74 MiB | 6.44 MiB rhtab/htab : +8 % ops, +0.8 % mem (32 B) +1 % ops, -4 % mem (4096 B) SUMMARY * Small / well-fitting map: htab is faster (cache-friendly fixed bucket array), but only by ~10-20 %. * Large / high-load-factor map: rhtab is dramatically faster (1.2x to 4x) because rhashtable resizes to keep the load factor sane while htab stays stuck at user-declared max. * Update-heavy workloads: rhtab is ~5x faster per producer via in-place memcpy. * Memory benchmark: effectively on par --- Changes in v7: - rhashtable_next_key: move into lib/rhashtable.c, drop params argument (Herbert). - rhashtable_next_key: kdoc clarifies that behavior on tables with duplicate keys is undefined (sashiko). - rhashtable: include Herbert's "Use irq work for shrinking" patch so __rhashtable_remove_fast_one() can fire the shrink path from NMI context (Herbert). - hashtab: fix u32 multiply overflow in __rhtab_map_lookup_and_delete_batch copy_to_user; cast total to size_t before multiplying by key_size / value_size (sashiko, bot+bpf-ci). - hashtab: allow kptr/refcount fields in rhtab values (same model as array map). - Link to v6: https://patch.msgid.link/20260602-rhash-v6-0-1bfd35a4184f@meta.com Changes in v6: - rhashtable_next_key: advance past duplicate keys in the main bucket chain to avoid an infinite loop when there are duplicate keys (sashiko). - rhashtable_next_key: return ERR_PTR(-EOPNOTSUPP) on rhltable (sashiko). - rhashtable: selftest pre-sizes the table to avoid concurrent rehash triggering spurious failures (sashiko). - hashtab: real rhtab_map_mem_usage in the basic commit; move bpf_map_free_internal_structs from rhtab_free_elem into the special-fields commit where it does meaningful work (bot+bpf-ci). - bpf_iter (seq_file): switch to rhashtable_walk_* for stronger coverage under concurrent rehash; get_next_key and batch keep rhashtable_next_key (sashiko). - iter ops: rhtab_map_get_next_key adds IS_ERR check before dereferencing the element pointer (sashiko). - iter ops: bpf_each_rhash_elem removes cond_resched() (sashiko). - iter ops: batch returns -EAGAIN (not -ENOENT) on cursor delete, so userspace can distinguish lost cursor from end-of-iteration and restart from NULL (sashiko). - Link to v5: https://patch.msgid.link/20260528-rhash-v5-0-7205191b6c57@meta.com Changes in v5: - rhashtable_next_key: add kdoc WARNING to highlight lack of rehash detection and unbounded iteration (Herbert). - rhashtable: selftest now checks IS_ERR() before PTR_ERR comparison on the missing-key path (bot+bpf-ci). - hashtab: drop dead stub bodies and unused map_ops registrations from the basic commit; iteration commit adds bodies, structs, and registrations together. .map_get_next_key keeps a stub registration in the basic commit because the syscall dispatcher does not NULL-check it; iteration commit replaces the stub body with the real implementation (bot+bpf-ci). - hashtab: fix batch cursor advancement. v4 stashed the lookahead element key but then resumed via next_key(cursor), skipping that element across batch boundaries and orphaning it on lookup_and_delete_batch. v5 stashes the lookahead key and looks it up directly on the next batch entry (bot+bpf-ci, sashiko v3). - hashtab: document torn-read race in rhtab_map_update_existing, matching arraymap semantics (bot+bpf-ci). - Link to v4: https://patch.msgid.link/20260513-rhash-v4-0-dd3d541ccb0b@meta.com Changes in v4: - rhashtable: introduce rhashtable_next_key(), drop walker-based iteration for BPF (also drops earlier rhashtable_walk_enter_from() proposal). - map_extra: presize hint via lower 32 bits (nelem_hint), capped at U16_MAX. - Automatic shrinking enabled (was missing despite being advertised). - Reject key_size > U16_MAX (rhashtable_params.key_len is u16). - Replace irqs_disabled() guard with bpf_disable_instrumentation around bucket-lock paths: closes same-CPU NMI tracing recursion without rejecting legitimate IRQ-context callers. - lookup_and_delete reordered: unlink before copy to avoid populating user buffer on concurrent-unlink -ENOENT. - update_existing reordered: copy then free_fields, matching arraymap. - Word-sized key fast path (sizeof(long) bytes), inlined hashfn/cmpfn via static-const rhashtable_params; works on both 32-bit and 64-bit. - check_and_init_map_value() on insert (zero special-field bytes from recycled bpf_mem_alloc memory; previously bpf_spin_lock could read garbage and qspinlock would deadlock). - BPF_SPIN_LOCK / BPF_RES_SPIN_LOCK allowlist moved to the special- fields commit so each commit is bisect-safe. - Link to v3: https://patch.msgid.link/20260424-rhash-v3-0-d0fa0ce4379b@meta.com Changes in v3: - Squash all commits implementing basic functions into one (Alexei) - Remove selftests that were not necessary (Alexei) - Resize detection for kernel full iterations, error out on resize (Alexei) - Remove second lookup in get_next_key() (Emil) - __acquires(RCU)/__releases(RCU) on seq_start/seq_stop (Emil) - Use bpf_map_check_op_flags() where it makes sense (Leon) - Benchmarks refresh, experiment with alternative hash functions - Rely on iterator invalidation during rehash to handle table resizes: fail on resize where we fully iterate on table inside kernel, dont fail on resize where iteration goes through userspace. Exception - rhtab_map_free_internal_structs() should be just safe to iterate fully in kernel, no risk of infinite loop, because no user holding reference. - Handle special fields during in-place updates (Emil, sashiko) - Link to v2: https://lore.kernel.org/all/20260408-rhash-v2-0-3b3675da1f6e@meta.com/ Changes in v2: - Added benchmarks - Reworked all functions that walk the rhashtable, use walk API, instead of directly accessing tbl and future_tbl - Added rhashtable_walk_enter_from() into rhashtable to support O(1) iteration continuations - Link to v1: https://lore.kernel.org/r/20260205-rhash-v1-0-30dd6d63c462@meta.com --- b4-submit-tracking --- # This section is used internally by b4 prep for tracking purposes. { "series": { "revision": 7, "change-id": "20251103-rhash-7b70069923d8", "prefixes": [ "bpf-next" ], "history": { "v1": [ "20260205-rhash-v1-0-30dd6d63c462@meta.com" ], "v2": [ "20260408-rhash-v2-0-3b3675da1f6e@meta.com" ], "v3": [ "20260424-rhash-v3-0-d0fa0ce4379b@meta.com" ], "v4": [ "20260513-rhash-v4-0-dd3d541ccb0b@meta.com" ], "v5": [ "20260528-rhash-v5-0-7205191b6c57@meta.com" ], "v6": [ "20260602-rhash-v6-0-1bfd35a4184f@meta.com" ] } } }

Introduce a simpler iteration mechanism for rhashtable that lets the caller continue from an arbitrary position by supplying the previous key, without the per-iterator state of the rhashtable_walk_* API. void *rhashtable_next_key(struct rhashtable *ht, const void *prev_key); Caller holds RCU; passes NULL prev_key for the first element or the previously returned key to advance. Walks tbl->future_tbl chain so in-flight rehashes are observed. Best-effort: in case of concurrent resize, provides no guarantees: - may produce duplicate elements - may skip any amount of elements - termination of the loop is not guaranteed in case of sustained rehash. Callers are advised to bound loop externally or avoid inserting new elements during such loop. Returns ERR_PTR(-ENOENT) if prev_key is not found. Behavior on tables with duplicate keys is undefined. rhltable is not supported — returns ERR_PTR(-EOPNOTSUPP). Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>

Insert n elements, then verify: - NULL prev_key walks from the beginning, visiting all n - non-existing prev_key returns ERR_PTR(-ENOENT) Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>

Use irq work for automatic shrinking so that this may be called in NMI context. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>

Use rhashtable_lookup_likely() for lookups, rhashtable_remove_fast() for deletes, and rhashtable_lookup_get_insert_fast() for inserts. Updates modify values in place under RCU rather than allocating a new element and swapping the pointer (as regular htab does). This trades read consistency for performance: concurrent readers may see partial updates. BPF_F_LOCK support and special-field handling (timers, kptrs, etc.) follow in a later commit. Initialize rhashtable with bpf_mem_alloc element cache. Require BPF_F_NO_PREALLOC. Limit max_entries to 2^31. Free elements via rhashtable_free_and_destroy(). Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>

Implement get_next_key, batch lookup/lookup-and-delete, for_each_map_elem callback, and the seq_file BPF iterator for BPF_MAP_TYPE_RHASH. get_next_key() and batch use rhashtable_next_key() — stateless, matches the syscall UAPI shape (no kernel-side iterator state). get_next_key falls back to the first key when prev_key was concurrently deleted (matches htab semantics). Batch reports cursor loss as -EAGAIN so userspace can distinguish it from end-of-iteration (-ENOENT) and restart from NULL. The seq_file BPF iterator uses rhashtable_walk_* instead. It runs only from read() syscall context, so the walker's spin_lock is safe, and seq_file's per-fd state lets the walker handle rehash correctly (retry on -EAGAIN) for stronger coverage than the stateless API can provide. Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>

Add support for timers, workqueues, task work, spin locks and kptrs. Without this, users needing deferred callbacks, BPF_F_LOCK, or refcounted kernel pointers in a dynamically-sized map have no option - fixed-size htab is the only map supporting these field types. Resizable hashtab should offer the same capability. kptr semantics under in-place updates are identical to array map. Properly clean up BTF record fields on element delete and map teardown by wiring up bpf_obj_free_fields through a memory allocator destructor, matching the pattern used by htab for non-prealloc maps. Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>

Specialize the lookup/update/delete paths for keys whose size matches sizeof(long) (4 bytes on 32-bit, 8 bytes on 64-bit). A static-const rhashtable_params lets the compiler inline a custom XOR-fold hashfn and a single-word equality cmpfn, eliminating the indirect jhash dispatch. The same hashfn and cmpfn are installed into rhashtable's stored params at rhashtable_init time, so the rehash worker, slow-path inserts, and rhashtable_next_key() all agree with the inlined fast paths. The seq_file BPF iterator uses rhashtable_walk_* and is unaffected. Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>

Add BPF_MAP_TYPE_RHASH to libbpf's map type name table and feature probing so that libbpf-based tools can create and identify resizable hash maps. Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>

Test basic map operations (lookup, update, delete) for BPF_MAP_TYPE_RHASH including boundary conditions like duplicate key insertion and deletion of nonexistent keys. Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>

Test basic BPF iterator functionality for BPF_MAP_TYPE_RHASH, verifying all elements are visited. Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>

Make bpftool documentation aware of the resizable hash map. Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>

Support resizable hashmap in BPF map benchmarks. 1. LOOKUP (single producer, M events/sec) key | max | nr | htab | rhtab | ratio | delta ----+-----+-------+---------+---------+-------+------- 8 | 1K | 750 | 99.85 | 81.92 | 0.82x | -18 % 8 | 1K | 1K | 100.71 | 80.19 | 0.80x | -20 % 8 | 1M | 750K | 23.37 | 72.09 | 3.08x | +208 % 8 | 1M | 1M | 13.39 | 53.72 | 4.01x | +301 % 32 | 1K | 750 | 51.57 | 42.78 | 0.83x | -17 % 32 | 1K | 1K | 50.81 | 45.83 | 0.90x | -10 % 32 | 1M | 750K | 11.27 | 15.29 | 1.36x | +36 % 32 | 1M | 1M | 7.32 | 8.75 | 1.19x | +19 % 256 | 1K | 750 | 7.58 | 7.88 | 1.04x | +4 % 256 | 1K | 1K | 7.43 | 7.81 | 1.05x | +5 % 256 | 1M | 750K | 3.69 | 4.27 | 1.16x | +16 % 256 | 1M | 1M | 2.60 | 3.12 | 1.20x | +20 % Pattern: * Small map (1K): htab wins for 8 / 32 byte keys by 10-20% * Large map (1M): rhtab wins everywhere, up to 4x at high load factor with 8 byte keys. * Higher load factor amplifies rhtab's lead: rhtab grows the bucket array; htab stays at user-declared max. 2. FULL UPDATE (M events/sec per producer) htab per-producer: 20.33 22.02 19.27 23.61 24.18 23.17 21.07 mean 21.94 range 19.27 - 24.18 rhtab per-producer: 133.51 129.47 74.52 129.29 102.26 129.98 107.64 mean 115.24 range 74.52 - 133.51 speedup (mean): 5.25x (+425 %) In-place memcpy avoids the per-update alloc + RCU pointer swap that htab pays. 3. MEMORY value_size | htab ops/s | rhtab ops/s | htab mem | rhtab mem -----------+-------------+-------------+----------+---------- 32 B | 122.87 k/s | 133.04 k/s | 2.47 MiB | 2.49 MiB 4096 B | 64.43 k/s | 65.38 k/s | 6.74 MiB | 6.44 MiB rhtab/htab : +8 % ops, +0.8 % mem (32 B) +1 % ops, -4 % mem (4096 B) Throughput effectively tied SUMMARY * Small / well-fitting map: htab is faster (cache-friendly fixed bucket array), but only by ~10-20 %. * Large / high-load-factor map: rhtab is dramatically faster (1.2x to 4x) because rhashtable resizes to keep the load factor sane while htab stays stuck at user-declared max. * Update-heavy workloads: rhtab is ~5x faster per producer via in-place memcpy. * Memory benchmark: effectively on par. Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>

kernel-patches-review-bot · 2026-06-05T11:30:00Z

commit 980718cb1a37440e4463f18ce386501112cd82dc
Author: Herbert Xu <herbert@gondor.apana.org.au>

rhashtable: Use irq work for shrinking

This converts the automatic-shrinking path of __rhashtable_remove_fast_one()
from schedule_work(&ht->run_work) to irq_work_queue(&ht->run_irq_work) so
that shrinking can be triggered from NMI or irq-disabled context.

>     rhashtable: Use irq work for shrinking
>
>     Use irq work for automatic shrinking so that this may be called in NMI
>     context.
>
>     Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
>     Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>

This looks like a fix for a context that was made irq-safe earlier.

The schedule_work() call in the shrink path is not safe from the
irq-disabled regions introduced when rhashtable's bucket locks were
switched from local_bh_disable()/local_bh_enable() to
local_irq_save()/local_irq_restore().

Should this include:

  Fixes: e47877c7aa82 ("rhashtable: Allow rhashtable to be used from irq-safe contexts")

AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

In-Reply-To-Subject: rhashtable: Use irq work for shrinking
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/27011687425

kernel-patches-review-bot · 2026-06-05T11:32:06Z

commit 09d8943141dea1b5df41d508f0e877750ae1aa63
Author: Mykyta Yatsenko <yatsenko@meta.com>

rhashtable: Add rhashtable_next_key() API

This adds rhashtable_next_key(), a best-effort iteration helper that lets
a caller advance from an arbitrary position by passing the previously
returned key, without the per-iterator state of the rhashtable_walk_*
API. The caller holds RCU and the walk follows the tbl->future_tbl chain
so in-flight rehashes are observed.

> diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
> index ef5230cece36..6f3aea498515 100644
> --- a/include/linux/rhashtable.h
> +++ b/include/linux/rhashtable.h
> @@ -650,6 +650,46 @@ static __always_inline struct rhash_head *__rhashtable_lookup(

[ ... ]

> + *   - Behavior on tables that contain duplicate keys is undefined:
> + *     duplicates may be skipped, repeated, or trap the walk in a
> + *     cycle. Callers requiring duplicate-key iteration must use
> + *     rhashtable_walk_*() instead.

[ ... ]

> diff --git a/lib/rhashtable.c b/lib/rhashtable.c
> index 04b3a808fca9..dd6eaa09c55d 100644
> --- a/lib/rhashtable.c
> +++ b/lib/rhashtable.c
> @@ -687,6 +687,75 @@ void *rhashtable_insert_slow(struct rhashtable *ht, const void *key,
> +static struct rhash_head *__rhashtable_next_in_table(
> +	struct rhashtable *ht, struct bucket_table *tbl,
> +	const void *prev_key)
> +{
> +	struct rhashtable_compare_arg arg = { .ht = ht, .key = prev_key };
> +	const struct rhashtable_params params = ht->p;
> +	struct rhash_head *he;
> +	unsigned int b = 0;
> +	bool found = false;
> +
> +	if (prev_key) {
> +		b = rht_key_hashfn(ht, tbl, prev_key, params);
> +		rht_for_each_rcu(he, tbl, b) {
> +			bool match = params.obj_cmpfn
> +				     ? !params.obj_cmpfn(&arg, rht_obj(ht, he))
> +				     : !rhashtable_compare(&arg, rht_obj(ht, he));
> +			if (found) {
> +				if (match)
> +					continue;
> +				return he;
> +			}
> +			if (match)
> +				found = true;
> +		}

Can this walk fail to make forward progress and trap an RCU-read-locked
caller in a cycle when duplicate keys are interleaved with a colliding
key in the same bucket?

This was raised as a concern in v6 and does not appear to have a code
change in response. Consider a bucket chain like:

	A2 -> B -> A1

where A1 and A2 share key A and B collides into the same bucket. With
prev_key=A, the walk matches the head A2 (found=true), then returns B.
With prev_key=B, it returns A1. With prev_key=A again, it matches the
head A2 and returns B, so the iterator oscillates between B and A
without advancing.

The duplicate-skip branch:

	if (found) {
		if (match)
			continue;
		return he;
	}

only collapses adjacent matches in the chain, so it does not cover the
interleaved A2 -> B -> A1 ordering.

The same chain can also arise from a concurrent delete and insert of key
A producing A_new -> B -> A_old, which would trap an iterator holding the
RCU read lock.

The kdoc in rhashtable.h now documents duplicate-key behavior as
undefined:

> + *   - Behavior on tables that contain duplicate keys is undefined:
> + *     duplicates may be skipped, repeated, or trap the walk in a
> + *     cycle. Callers requiring duplicate-key iteration must use
> + *     rhashtable_walk_*() instead.

and the commit message notes BPF codepaths cannot produce duplicate
keys. The code-level cycle is still reachable for any caller on a table
that does contain duplicate keys. Was a code-level guard considered for
the interleaved case, or is the documented "undefined" contract intended
to be the only mitigation?

This was discussed upstream:
https://lore.kernel.org/bpf/20260602205716.BCDBE1F00893@smtp.kernel.org/

AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

In-Reply-To-Subject: rhashtable: Add rhashtable_next_key() API
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/27011687425

kernel-patches-review-bot · 2026-06-05T11:34:54Z

commit 2bb32731d59830059ed6a2bd730eeb36574cbd3c
Author: Mykyta Yatsenko <yatsenko@meta.com>

bpf: Implement iteration ops for resizable hashtab

This commit implements get_next_key, batch lookup/lookup-and-delete,
for_each_map_elem callback, and the seq_file BPF iterator for
BPF_MAP_TYPE_RHASH.

> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
> index 10f3a058747b..a149713d0953 100644
> --- a/kernel/bpf/hashtab.c
> +++ b/kernel/bpf/hashtab.c

[ ... ]

> @@ -3020,8 +3020,79 @@ static void rhtab_map_free_internal_structs(struct bpf_map *map)
>  }
>
> +static long bpf_each_rhash_elem(struct bpf_map *map, bpf_callback_t callback_fn,
> +				void *callback_ctx, u64 flags)
> +{
> +	struct bpf_rhtab *rhtab = container_of(map, struct bpf_rhtab, map);
> +	void *prev_key = NULL;
> +	struct rhtab_elem *elem;
> +	int num_elems = 0;
> +	u64 ret = 0;
> +
> +	cant_migrate();
> +
> +	if (flags != 0)
> +		return -EINVAL;
> +
> +	rcu_read_lock();
> +	/*
> +	 * Best-effort iteration: if rhashtable is concurrently resized or
> +	 * elements are deleted/inserted, there may be missed or duplicate
> +	 * elements visited.
> +	 */
> +	while ((elem = rhashtable_next_key(&rhtab->ht, prev_key))) {
> +		if (IS_ERR(elem))
> +			break;

The rhashtable_next_key documentation notes that a full iteration is not
guaranteed to terminate under adversarial or sustained rehashing. Since
this loop lacks an external bound and runs under rcu_read_lock with
migration disabled, can continuous concurrent resizes cause an infinite
loop, resulting in a soft lockup or RCU stall?

If prev_key is deleted concurrently or by the callback itself,
rhashtable_next_key returns ERR_PTR(-ENOENT). Does breaking the loop here
cause the iteration to immediately abort and skip the rest of the map?

> +		num_elems++;
> +		ret = callback_fn((u64)(long)map,
> +				  (u64)(long)elem->data,
> +				  (u64)(long)rhtab_elem_value(elem, map->key_size),
> +				  (u64)(long)callback_ctx, 0);
> +		if (ret)
> +			break;
> +
> +		prev_key = elem->data;	/* valid while RCU held */
> +	}
> +	rcu_read_unlock();
> +
> +	return num_elems;
> +}

[ ... ]

> @@ -3034,6 +3105,275 @@ static u64 rhtab_map_mem_usage(const struct bpf_map *map)
>  	return sizeof(struct bpf_rhtab) + rhtab->elem_size * num_entries;
>  }
>
> +static int __rhtab_map_lookup_and_delete_batch(struct bpf_map *map,
> +					       const union bpf_attr *attr,
> +					       union bpf_attr __user *uattr,
> +					       bool do_delete)
> +{

[ ... ]

> +	rcu_read_lock();
> +
> +	/*
> +	 * Cursor stores the key of the next-to-process element (stashed by
> +	 * the previous batch). Look it up directly so the element is included
> +	 * here rather than skipped by next_key(). If the cursor was deleted
> +	 * concurrently (or by the previous do_delete batch), return -EAGAIN
> +	 * so userspace can distinguish a lost cursor from end-of-iteration
> +	 * (-ENOENT) and restart from a NULL cursor.
> +	 */
> +	if (ubatch) {
> +		elem = rhtab_lookup_elem(map, cursor);
> +		if (!elem) {
> +			rcu_read_unlock();
> +			ret = -EAGAIN;
> +			goto free;
> +		}
> +	} else {
> +		elem = rhashtable_next_key(&rhtab->ht, NULL);
> +	}
> +
> +	while (elem && !IS_ERR(elem) && total < max_count) {
> +		memcpy(dst_key, elem->data, key_size);
> +		rhtab_read_elem_value(map, dst_val, elem, elem_map_flags);
> +		check_and_init_map_value(map, dst_val);
> +
> +		if (do_delete)
> +			del_elems[total] = elem;
> +
> +		elem = rhashtable_next_key(&rhtab->ht, dst_key);
> +		dst_key += key_size;
> +		dst_val += value_size;
> +		total++;
> +
> +		/* Bail to userspace to avoid stalls. */
> +		if (need_resched())
> +			break;
> +	}

Similar to the callback iteration, if a concurrent deletion occurs and
elem receives -ENOENT, this loop breaks. If total is greater than 0 when
this happens, doesn't this falsely signal to userspace that the map is
fully iterated, leading to missed elements?

> +
> +	if (elem && !IS_ERR(elem)) {
> +		/* Stash next-to-process key as cursor for the next batch. */
> +		memcpy(cursor, elem->data, key_size);
> +		has_next_cursor = true;
> +	}
> +
> +	if (do_delete) {
> +		for (i = 0; i < total; i++)
> +			rhtab_delete_elem(rhtab, del_elems[i], NULL, 0);
> +	}
> +
> +	rcu_read_unlock();

Deferring the deletion until after the iteration leaves elements visible
in the hash table while they are being copied into user buffers. Does
this create a race window where concurrent batch operations can traverse,
copy, and yield the same elements to userspace, breaking the exactly-once
consumption semantics expected for work queues?

> +
> +	if (total == 0) {
> +		ret = -ENOENT;
> +		goto free;
> +	}
> +
> +	/* No more elements after this batch. */
> +	if (!has_next_cursor)
> +		ret = -ENOENT;
> +
> +	if (copy_to_user(ukeys, keys, (size_t)total * key_size) ||
> +	    copy_to_user(uvalues, values, (size_t)total * value_size) ||
> +	    put_user(total, &uattr->batch.count) ||
> +	    (has_next_cursor &&
> +	     copy_to_user(u64_to_user_ptr(attr->batch.out_batch),
> +			  cursor, key_size))) {
> +		ret = -EFAULT;
> +		goto free;
> +	}
> +
> +free:
> +	kfree(cursor);
> +	kvfree(keys);
> +	kvfree(values);
> +	kvfree(del_elems);
> +	return ret;
> +}

[ ... ]

These concerns were raised by sashiko-bot@kernel.org in v6 review:
https://lore.kernel.org/bpf/20260602210416.2E0631F00893@smtp.kernel.org/

AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

In-Reply-To-Subject: bpf: Implement iteration ops for resizable hashtab
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/27011687425

kernel-patches-review-bot · 2026-06-05T11:36:05Z

commit 19dff21d2fed6076d1ca196fa711f085ceafe756
Author: Mykyta Yatsenko <yatsenko@meta.com>

bpf: Optimize word-sized keys for resizable hashtable

Specialize the lookup/update/delete paths for keys whose size matches
sizeof(long) (4 bytes on 32-bit, 8 bytes on 64-bit). A static-const
rhashtable_params lets the compiler inline a custom XOR-fold hashfn and
a single-word equality cmpfn, eliminating the indirect jhash dispatch.
The same hashfn and cmpfn are installed into rhashtable's stored params
at rhashtable_init time, so the rehash worker, slow-path inserts, and
rhashtable_next_key() all agree with the inlined fast paths.

The seq_file BPF iterator uses rhashtable_walk_* and is unaffected.

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>

> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
> index 7b9408b8320c..b4366cad3cfa 100644
> --- a/kernel/bpf/hashtab.c
> +++ b/kernel/bpf/hashtab.c
> @@ -2763,6 +2763,31 @@ static inline void *rhtab_elem_value(struct rhtab_elem *l, u32 key_size)
>  	return l->data + round_up(key_size, 8);
>  }
>
> +/* Specialize hash function and objcmp for long sized key */
> +static __always_inline int rhtab_key_cmp_long(struct rhashtable_compare_arg *arg,
> +					      const void *ptr)
> +{
> +	const unsigned long key1 = *(const unsigned long *)arg->key;
                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Could dereferencing this pointer directly cause alignment faults on
architectures that require strict alignment? Since BPF programs can pass
unaligned pointers such as from packet data in skb->data directly to map
helpers, using get_unaligned() might be necessary here to prevent kernel
panics on ARM32, SPARC, and other strict-alignment architectures.

This concern was raised by sashiko-bot@kernel.org in v5 and v6:
https://lore.kernel.org/bpf/20260528182522.23D121F00A3C@smtp.kernel.org/

The author acknowledged in v5 that "the same problem exists for the htab
jhash2" but did not address it in subsequent versions. The issue remains
in this version.

> +	const struct rhtab_elem *key2 = ptr;
> +
> +	return key1 != *(const unsigned long *)key2->data;
> +}
> +
> +static __always_inline u32 rhtab_hashfn_long(const void *data, u32 len, u32 seed)
> +{
> +	u64 k = *(const unsigned long *)data;
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> +
> +	return (u32)(k ^ (k >> 32)) ^ seed;
                                    ^^^^^^

Does applying the seed after folding the key make hash collisions entirely
independent of the seed?

If a BPF program or an attacker inserts keys constructed as:

    k = (Y << 32) | (X ^ Y)

These keys will all fold to exactly X before the XOR with the seed, meaning:

    f(key1) ^ seed == f(key2) ^ seed

for any random seed.

If rhashtable detects a long bucket chain and triggers a background rehash
with a new seed, won't all these colliding keys move together to the same new
bucket? This might trap the rhashtable worker in an infinite loop of
rehashing, leading to a 100 percent CPU livelock.

This concern was raised by sashiko-bot@kernel.org in v5 and v6:
https://lore.kernel.org/bpf/20260528182522.23D121F00A3C@smtp.kernel.org/

The author acknowledged in v5 that "the attack is possible in theory" but
decided to "leave for performance" without addressing the vulnerability. The
issue remains in this version.

> +}
> +
> +static const struct rhashtable_params rhtab_params_long = {
> +	.head_offset = offsetof(struct rhtab_elem, node),
> +	.key_offset  = offsetof(struct rhtab_elem, data),
> +	.key_len     = sizeof(long),
> +	.hashfn      = rhtab_hashfn_long,
> +	.obj_cmpfn   = rhtab_key_cmp_long,
> +};
> +
>  static struct bpf_map *rhtab_map_alloc(union bpf_attr *attr)
>  {
>  	struct rhashtable_params params;

[ ... ]

AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

In-Reply-To-Subject: bpf: Optimize word-sized keys for resizable hashtable
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/27011687425

Kernel Patches Daemon and others added 14 commits June 4, 2026 09:59

adding ci files

73f574b

rhashtable: Add selftest for rhashtable_next_key()

7834cc4

Insert n elements, then verify: - NULL prev_key walks from the beginning, visiting all n - non-existing prev_key returns ERR_PTR(-ENOENT) Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>

rhashtable: Use irq work for shrinking

980718c

Use irq work for automatic shrinking so that this may be called in NMI context. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>

libbpf: Support resizable hashtable

8317af0

Add BPF_MAP_TYPE_RHASH to libbpf's map type name table and feature probing so that libbpf-based tools can create and identify resizable hash maps. Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>

selftests/bpf: Add basic tests for resizable hash map

9d44331

Test basic map operations (lookup, update, delete) for BPF_MAP_TYPE_RHASH including boundary conditions like duplicate key insertion and deletion of nonexistent keys. Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>

selftests/bpf: Add BPF iterator tests for resizable hash map

aa2eefe

Test basic BPF iterator functionality for BPF_MAP_TYPE_RHASH, verifying all elements are visited. Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>

bpftool: Add rhash map documentation

057c710

Make bpftool documentation aware of the resizable hash map. Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>

kernel-patches-review-bot Bot added the ai-review label Jun 5, 2026

kernel-patches-daemon-bpf Bot force-pushed the bpf-next_base branch 11 times, most recently from 4f5632b to 970af1b Compare June 7, 2026 19:43

kernel-patches-daemon-bpf Bot force-pushed the bpf-next_base branch 8 times, most recently from 3a26044 to 818f7b1 Compare June 10, 2026 04:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

B4/rhash#12359

B4/rhash#12359
mykyta5 wants to merge 14 commits into
bpf-next_basefrom
b4/rhash

mykyta5 commented Jun 5, 2026

Uh oh!

kernel-patches-review-bot Bot commented Jun 5, 2026

Uh oh!

kernel-patches-review-bot Bot commented Jun 5, 2026

Uh oh!

kernel-patches-review-bot Bot commented Jun 5, 2026

Uh oh!

kernel-patches-review-bot Bot commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mykyta5 commented Jun 5, 2026

Uh oh!

kernel-patches-review-bot Bot commented Jun 5, 2026

Uh oh!

kernel-patches-review-bot Bot commented Jun 5, 2026

Uh oh!

kernel-patches-review-bot Bot commented Jun 5, 2026

Uh oh!

kernel-patches-review-bot Bot commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants