Skip to content

B4/rhash#12359

Draft
mykyta5 wants to merge 14 commits into
bpf-next_basefrom
b4/rhash
Draft

B4/rhash#12359
mykyta5 wants to merge 14 commits into
bpf-next_basefrom
b4/rhash

Conversation

@mykyta5

@mykyta5 mykyta5 commented Jun 5, 2026

Copy link
Copy Markdown
Collaborator

No description provided.

Kernel Patches Daemon and others added 14 commits June 4, 2026 09:59
This patch series introduces BPF_MAP_TYPE_RHASH, a new hash map type that
leverages the kernel's rhashtable to provide resizable hash map for BPF.

The existing BPF_MAP_TYPE_HASH uses a fixed number of buckets determined at
map creation time. While this works well for many use cases, it presents
challenges when:

1. The number of elements is unknown at creation time
2. The element count varies significantly during runtime
3. Memory efficiency is important (over-provisioning wastes memory,
 under-provisioning hurts performance)

BPF_MAP_TYPE_RHASH addresses these issues by using rhashtable, which
automatically grows and shrinks based on load factor.

The implementation wraps the kernel's rhashtable with BPF map operations:

- Uses bpf_mem_alloc for RCU-safe memory management
- Supports all standard map operations (lookup, update, delete, get_next_key)
- Supports batch operations (lookup_batch, lookup_and_delete_batch)
- Supports BPF iterators for traversal
- Supports BPF_F_LOCK for spin locks in values
- Requires BPF_F_NO_PREALLOC flag (elements allocated on demand)
- In-place updates for improved performance.
- max_entries serves as a hard limit, not bucket count
- Uses bit_spin_lock() + local_irq_save() for bucket locking,
similar to existing BPF hashmap's raw_spin_lock_irqsave(), insertions and
deletes may fail.
- Iterations are best-effort, if resize, insertions, deletions take place
concurrently, iterations may visit same elements multiple times or skip
elements.
- Lock out insertions, when running special fields destructor to guarantee
its completion.

The series includes comprehensive tests:
- Basic operations in test_maps (lookup, update, delete, get_next_key)
- BPF program tests for lookup/update/delete semantics
- Seq file tests

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>

---

Update implementation
---------------------
Current implementation of the BPF_MAP_TYPE_RHASH does not provide
the same strong guarantees on the values consistency under concurrent
reads/writes as BPF_MAP_TYPE_HASH.
BPF_MAP_TYPE_HASH allocates a new element and atomically swaps the
pointer. BPF_MAP_TYPE_RHASH does memcpy in place with no lock held.
rhash trades consistency for speed, concurrent readers can observe
partially updated data. Two concurrent writers to the same key can
also interleave, producing mixed values. This is similar to arraymap
update implementation, including handling of the special fields.
As a solution, user may use BPF_F_LOCK to guarantee consistent reads
and write serialization.

Summary of the read consistency guarantees:

  map type     |  write mechanism |  read consistency
  -------------+------------------+--------------------------
  htab         |  alloc, swap ptr |  always consistent (RCU)
  htab  F_LOCK |  in-place + lock |  consistent if reader locks
  -------------+------------------+--------------------------
  rhtab        |  in-place memcpy |  torn reads
  rhtab F_LOCK |  in-place + lock |  consistent if reader locks

Benchmarks
----------
1. LOOKUP  (single producer, M events/sec)
  key | max | nr    |    htab |   rhtab | ratio | delta
  ----+-----+-------+---------+---------+-------+-------
    8 |  1K |   750 |   99.85 |   81.92 | 0.82x |  -18 %
    8 |  1K |    1K |  100.71 |   80.19 | 0.80x |  -20 %
    8 |  1M |  750K |   23.37 |   72.09 | 3.08x | +208 %
    8 |  1M |    1M |   13.39 |   53.72 | 4.01x | +301 %
   32 |  1K |   750 |   51.57 |   42.78 | 0.83x |  -17 %
   32 |  1K |    1K |   50.81 |   45.83 | 0.90x |  -10 %
   32 |  1M |  750K |   11.27 |   15.29 | 1.36x |  +36 %
   32 |  1M |    1M |    7.32 |    8.75 | 1.19x |  +19 %
  256 |  1K |   750 |    7.58 |    7.88 | 1.04x |   +4 %
  256 |  1K |    1K |    7.43 |    7.81 | 1.05x |   +5 %
  256 |  1M |  750K |    3.69 |    4.27 | 1.16x |  +16 %
  256 |  1M |    1M |    2.60 |    3.12 | 1.20x |  +20 %

Pattern:
  * Small map (1K): htab wins for 8 / 32 byte keys by 10-20 %
    because the preallocated bucket array fits in L1.  Equalises
    at 256 byte keys.
  * Large map (1M): rhtab wins everywhere, up to 4x at high load
    factor with 8 byte keys.
  * Higher load factor amplifies rhtab's lead: rhtab grows the
    bucket array; htab stays at user-declared max.

2. FULL UPDATE  (M events/sec per producer, -p 7)

  htab  per-producer:
    20.33   22.02   19.27   23.61   24.18   23.17   21.07
    mean  21.94   range  19.27 - 24.18

  rhtab per-producer:
   133.51  129.47   74.52  129.29  102.26  129.98  107.64
    mean 115.24   range  74.52 - 133.51

  speedup (mean): 5.25x   (+425 %)

In-place memcpy avoids the per-update alloc + RCU pointer swap
that htab pays.

3. MEMORY  (overwrite, -p 8, no --preallocated)

  value_size |  htab ops/s | rhtab ops/s | htab mem | rhtab mem
  -----------+-------------+-------------+----------+----------
       32 B  |  122.87 k/s |  133.04 k/s | 2.47 MiB | 2.49 MiB
     4096 B  |   64.43 k/s |   65.38 k/s | 6.74 MiB | 6.44 MiB
  rhtab/htab :  +8 % ops, +0.8 % mem   (32 B)
                +1 % ops,  -4  % mem (4096 B)

SUMMARY

  * Small / well-fitting map: htab is faster (cache-friendly
    fixed bucket array), but only by ~10-20 %.
  * Large / high-load-factor map: rhtab is dramatically faster
    (1.2x to 4x) because rhashtable resizes to keep the load
    factor sane while htab stays stuck at user-declared max.
  * Update-heavy workloads: rhtab is ~5x faster per producer
    via in-place memcpy.
  * Memory benchmark: effectively on par

---
Changes in v7:
- rhashtable_next_key: move into lib/rhashtable.c, drop params argument
  (Herbert).
- rhashtable_next_key: kdoc clarifies that behavior on tables with
  duplicate keys is undefined (sashiko).
- rhashtable: include Herbert's "Use irq work for shrinking" patch so
  __rhashtable_remove_fast_one() can fire the shrink path from NMI
  context (Herbert).
- hashtab: fix u32 multiply overflow in __rhtab_map_lookup_and_delete_batch
  copy_to_user; cast total to size_t before multiplying by key_size /
  value_size (sashiko, bot+bpf-ci).
- hashtab: allow kptr/refcount fields in rhtab values (same model as
  array map).
- Link to v6: https://patch.msgid.link/20260602-rhash-v6-0-1bfd35a4184f@meta.com

Changes in v6:
- rhashtable_next_key: advance past duplicate keys in the main bucket
  chain to avoid an infinite loop when there are duplicate keys
  (sashiko).
- rhashtable_next_key: return ERR_PTR(-EOPNOTSUPP) on rhltable (sashiko).
- rhashtable: selftest pre-sizes the table to avoid concurrent rehash
  triggering spurious failures (sashiko).
- hashtab: real rhtab_map_mem_usage in the basic commit; move
  bpf_map_free_internal_structs from rhtab_free_elem into the
  special-fields commit where it does meaningful work (bot+bpf-ci).
- bpf_iter (seq_file): switch to rhashtable_walk_* for stronger
  coverage under concurrent rehash; get_next_key and batch keep
  rhashtable_next_key (sashiko).
- iter ops: rhtab_map_get_next_key adds IS_ERR check
  before dereferencing the element pointer (sashiko).
- iter ops: bpf_each_rhash_elem removes cond_resched() (sashiko).
- iter ops: batch returns -EAGAIN (not -ENOENT) on cursor delete,
  so userspace can distinguish lost cursor from end-of-iteration
  and restart from NULL (sashiko).

- Link to v5: https://patch.msgid.link/20260528-rhash-v5-0-7205191b6c57@meta.com

Changes in v5:
- rhashtable_next_key: add kdoc WARNING to highlight lack of rehash
  detection and unbounded iteration (Herbert).
- rhashtable: selftest now checks IS_ERR() before PTR_ERR comparison
  on the missing-key path (bot+bpf-ci).
- hashtab: drop dead stub bodies and unused map_ops registrations
  from the basic commit; iteration commit adds bodies, structs, and
  registrations together. .map_get_next_key keeps a stub registration
  in the basic commit because the syscall dispatcher does not
  NULL-check it; iteration commit replaces the stub body with the
  real implementation (bot+bpf-ci).
- hashtab: fix batch cursor advancement. v4 stashed the lookahead
  element key but then resumed via next_key(cursor), skipping that
  element across batch boundaries and orphaning it on
  lookup_and_delete_batch. v5 stashes the lookahead key and looks
  it up directly on the next batch entry (bot+bpf-ci, sashiko v3).
- hashtab: document torn-read race in rhtab_map_update_existing,
  matching arraymap semantics (bot+bpf-ci).
- Link to v4: https://patch.msgid.link/20260513-rhash-v4-0-dd3d541ccb0b@meta.com

Changes in v4:
- rhashtable: introduce rhashtable_next_key(), drop walker-based
  iteration for BPF (also drops earlier rhashtable_walk_enter_from()
  proposal).
- map_extra: presize hint via lower 32 bits (nelem_hint), capped at
  U16_MAX.
- Automatic shrinking enabled (was missing despite being advertised).
- Reject key_size > U16_MAX (rhashtable_params.key_len is u16).
- Replace irqs_disabled() guard with bpf_disable_instrumentation around
  bucket-lock paths: closes same-CPU NMI tracing recursion without
  rejecting legitimate IRQ-context callers.
- lookup_and_delete reordered: unlink before copy to avoid populating
  user buffer on concurrent-unlink -ENOENT.
- update_existing reordered: copy then free_fields, matching arraymap.
- Word-sized key fast path (sizeof(long) bytes), inlined hashfn/cmpfn
  via static-const rhashtable_params; works on both 32-bit and 64-bit.
- check_and_init_map_value() on insert (zero special-field bytes from
  recycled bpf_mem_alloc memory; previously bpf_spin_lock could read
  garbage and qspinlock would deadlock).
- BPF_SPIN_LOCK / BPF_RES_SPIN_LOCK allowlist moved to the special-
  fields commit so each commit is bisect-safe.
- Link to v3: https://patch.msgid.link/20260424-rhash-v3-0-d0fa0ce4379b@meta.com

Changes in v3:
- Squash all commits implementing basic functions into one (Alexei)
- Remove selftests that were not necessary (Alexei)
- Resize detection for kernel full iterations, error out on resize (Alexei)
- Remove second lookup in get_next_key() (Emil)
- __acquires(RCU)/__releases(RCU) on seq_start/seq_stop (Emil)
- Use bpf_map_check_op_flags() where it makes sense (Leon)
- Benchmarks refresh, experiment with alternative hash functions
- Rely on iterator invalidation during rehash to handle table resizes:
fail on resize where we fully iterate on table inside kernel, dont fail on
resize where iteration goes through userspace. Exception -
rhtab_map_free_internal_structs() should be just safe to iterate fully
in kernel, no risk of infinite loop, because no user holding reference.
- Handle special fields during in-place updates (Emil, sashiko)
- Link to v2: https://lore.kernel.org/all/20260408-rhash-v2-0-3b3675da1f6e@meta.com/

Changes in v2:
- Added benchmarks
- Reworked all functions that walk the rhashtable, use walk API, instead
of directly accessing tbl and future_tbl
- Added rhashtable_walk_enter_from() into rhashtable to support O(1)
iteration continuations
- Link to v1: https://lore.kernel.org/r/20260205-rhash-v1-0-30dd6d63c462@meta.com

--- b4-submit-tracking ---
# This section is used internally by b4 prep for tracking purposes.
{
  "series": {
    "revision": 7,
    "change-id": "20251103-rhash-7b70069923d8",
    "prefixes": [
      "bpf-next"
    ],
    "history": {
      "v1": [
        "20260205-rhash-v1-0-30dd6d63c462@meta.com"
      ],
      "v2": [
        "20260408-rhash-v2-0-3b3675da1f6e@meta.com"
      ],
      "v3": [
        "20260424-rhash-v3-0-d0fa0ce4379b@meta.com"
      ],
      "v4": [
        "20260513-rhash-v4-0-dd3d541ccb0b@meta.com"
      ],
      "v5": [
        "20260528-rhash-v5-0-7205191b6c57@meta.com"
      ],
      "v6": [
        "20260602-rhash-v6-0-1bfd35a4184f@meta.com"
      ]
    }
  }
}
Introduce a simpler iteration mechanism for rhashtable that lets
the caller continue from an arbitrary position by supplying the
previous key, without the per-iterator state of the
rhashtable_walk_* API.

  void *rhashtable_next_key(struct rhashtable *ht,
                            const void *prev_key);

Caller holds RCU; passes NULL prev_key for the first element or
the previously returned key to advance. Walks tbl->future_tbl
chain so in-flight rehashes are observed.

Best-effort: in case of concurrent resize, provides no guarantees:
 - may produce duplicate elements
 - may skip any amount of elements
 - termination of the loop is not guaranteed in case of
 sustained rehash. Callers are advised to bound loop externally
 or avoid inserting new elements during such loop.

Returns ERR_PTR(-ENOENT) if prev_key is not found.
Behavior on tables with duplicate keys is undefined.
rhltable is not supported — returns ERR_PTR(-EOPNOTSUPP).

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Insert n elements, then verify:
  - NULL prev_key walks from the beginning, visiting all n
  - non-existing prev_key returns ERR_PTR(-ENOENT)

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Use irq work for automatic shrinking so that this may be called in NMI
context.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Use rhashtable_lookup_likely() for lookups, rhashtable_remove_fast()
for deletes, and rhashtable_lookup_get_insert_fast() for inserts.

Updates modify values in place under RCU rather than allocating a
new element and swapping the pointer (as regular htab does). This
trades read consistency for performance: concurrent readers may
see partial updates. BPF_F_LOCK support and special-field
handling (timers, kptrs, etc.) follow in a later commit.

Initialize rhashtable with bpf_mem_alloc element cache. Require
BPF_F_NO_PREALLOC. Limit max_entries to 2^31. Free elements via
rhashtable_free_and_destroy().

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Implement get_next_key, batch lookup/lookup-and-delete, for_each_map_elem
callback, and the seq_file BPF iterator for BPF_MAP_TYPE_RHASH.

get_next_key() and batch use rhashtable_next_key() — stateless,
matches the syscall UAPI shape (no kernel-side iterator state).
get_next_key falls back to the first key when prev_key was
concurrently deleted (matches htab semantics). Batch reports
cursor loss as -EAGAIN so userspace can distinguish it from
end-of-iteration (-ENOENT) and restart from NULL.

The seq_file BPF iterator uses rhashtable_walk_* instead. It runs
only from read() syscall context, so the walker's spin_lock is
safe, and seq_file's per-fd state lets the walker handle rehash
correctly (retry on -EAGAIN) for stronger coverage than the
stateless API can provide.

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Add support for timers, workqueues, task work, spin locks and kptrs.
Without this, users needing deferred callbacks, BPF_F_LOCK, or
refcounted kernel pointers in a dynamically-sized map have no option -
fixed-size htab is the only map supporting these field types.
Resizable hashtab should offer the same capability.

kptr semantics under in-place updates are identical to array map.

Properly clean up BTF record fields on element delete and map
teardown by wiring up bpf_obj_free_fields through a memory allocator
destructor, matching the pattern used by htab for non-prealloc maps.

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Specialize the lookup/update/delete paths for keys whose size matches
sizeof(long) (4 bytes on 32-bit, 8 bytes on 64-bit). A static-const
rhashtable_params lets the compiler inline a custom XOR-fold hashfn and
a single-word equality cmpfn, eliminating the indirect jhash dispatch.
The same hashfn and cmpfn are installed into rhashtable's stored params
at rhashtable_init time, so the rehash worker, slow-path inserts, and
rhashtable_next_key() all agree with the inlined fast paths.

The seq_file BPF iterator uses rhashtable_walk_* and is unaffected.

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Add BPF_MAP_TYPE_RHASH to libbpf's map type name table and feature
probing so that libbpf-based tools can create and identify resizable
hash maps.

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Test basic map operations (lookup, update, delete) for
BPF_MAP_TYPE_RHASH including boundary conditions like duplicate
key insertion and deletion of nonexistent keys.

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Test basic BPF iterator functionality for BPF_MAP_TYPE_RHASH,
verifying all elements are visited.

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Make bpftool documentation aware of the resizable hash map.

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Support resizable hashmap in BPF map benchmarks.

1. LOOKUP  (single producer, M events/sec)

  key | max | nr    |    htab |   rhtab | ratio | delta
  ----+-----+-------+---------+---------+-------+-------
    8 |  1K |   750 |   99.85 |   81.92 | 0.82x |  -18 %
    8 |  1K |    1K |  100.71 |   80.19 | 0.80x |  -20 %
    8 |  1M |  750K |   23.37 |   72.09 | 3.08x | +208 %
    8 |  1M |    1M |   13.39 |   53.72 | 4.01x | +301 %
   32 |  1K |   750 |   51.57 |   42.78 | 0.83x |  -17 %
   32 |  1K |    1K |   50.81 |   45.83 | 0.90x |  -10 %
   32 |  1M |  750K |   11.27 |   15.29 | 1.36x |  +36 %
   32 |  1M |    1M |    7.32 |    8.75 | 1.19x |  +19 %
  256 |  1K |   750 |    7.58 |    7.88 | 1.04x |   +4 %
  256 |  1K |    1K |    7.43 |    7.81 | 1.05x |   +5 %
  256 |  1M |  750K |    3.69 |    4.27 | 1.16x |  +16 %
  256 |  1M |    1M |    2.60 |    3.12 | 1.20x |  +20 %

Pattern:
  * Small map (1K): htab wins for 8 / 32 byte keys by 10-20%
  * Large map (1M): rhtab wins everywhere, up to 4x at high load
    factor with 8 byte keys.
  * Higher load factor amplifies rhtab's lead: rhtab grows the
    bucket array; htab stays at user-declared max.

2. FULL UPDATE  (M events/sec per producer)

  htab  per-producer:
    20.33   22.02   19.27   23.61   24.18   23.17   21.07
    mean  21.94   range  19.27 - 24.18

  rhtab per-producer:
   133.51  129.47   74.52  129.29  102.26  129.98  107.64
    mean 115.24   range  74.52 - 133.51

  speedup (mean): 5.25x   (+425 %)

In-place memcpy avoids the per-update alloc + RCU pointer swap
that htab pays.

3. MEMORY

  value_size |  htab ops/s | rhtab ops/s | htab mem | rhtab mem
  -----------+-------------+-------------+----------+----------
       32 B  |  122.87 k/s |  133.04 k/s | 2.47 MiB | 2.49 MiB
     4096 B  |   64.43 k/s |   65.38 k/s | 6.74 MiB | 6.44 MiB
  rhtab/htab :  +8 % ops, +0.8 % mem   (32 B)
                +1 % ops,  -4  % mem (4096 B)

Throughput effectively tied

SUMMARY

  * Small / well-fitting map: htab is faster (cache-friendly
    fixed bucket array), but only by ~10-20 %.
  * Large / high-load-factor map: rhtab is dramatically faster
    (1.2x to 4x) because rhashtable resizes to keep the load
    factor sane while htab stays stuck at user-declared max.
  * Update-heavy workloads: rhtab is ~5x faster per producer
    via in-place memcpy.
  * Memory benchmark: effectively on par.

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
@kernel-patches-review-bot

Copy link
Copy Markdown
commit 980718cb1a37440e4463f18ce386501112cd82dc
Author: Herbert Xu <herbert@gondor.apana.org.au>

rhashtable: Use irq work for shrinking

This converts the automatic-shrinking path of __rhashtable_remove_fast_one()
from schedule_work(&ht->run_work) to irq_work_queue(&ht->run_irq_work) so
that shrinking can be triggered from NMI or irq-disabled context.

>     rhashtable: Use irq work for shrinking
>
>     Use irq work for automatic shrinking so that this may be called in NMI
>     context.
>
>     Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
>     Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>

This looks like a fix for a context that was made irq-safe earlier.

The schedule_work() call in the shrink path is not safe from the
irq-disabled regions introduced when rhashtable's bucket locks were
switched from local_bh_disable()/local_bh_enable() to
local_irq_save()/local_irq_restore().

Should this include:

  Fixes: e47877c7aa82 ("rhashtable: Allow rhashtable to be used from irq-safe contexts")



AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

In-Reply-To-Subject: rhashtable: Use irq work for shrinking
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/27011687425

@kernel-patches-review-bot

Copy link
Copy Markdown
commit 09d8943141dea1b5df41d508f0e877750ae1aa63
Author: Mykyta Yatsenko <yatsenko@meta.com>

rhashtable: Add rhashtable_next_key() API

This adds rhashtable_next_key(), a best-effort iteration helper that lets
a caller advance from an arbitrary position by passing the previously
returned key, without the per-iterator state of the rhashtable_walk_*
API. The caller holds RCU and the walk follows the tbl->future_tbl chain
so in-flight rehashes are observed.

> diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
> index ef5230cece36..6f3aea498515 100644
> --- a/include/linux/rhashtable.h
> +++ b/include/linux/rhashtable.h
> @@ -650,6 +650,46 @@ static __always_inline struct rhash_head *__rhashtable_lookup(

[ ... ]

> + *   - Behavior on tables that contain duplicate keys is undefined:
> + *     duplicates may be skipped, repeated, or trap the walk in a
> + *     cycle. Callers requiring duplicate-key iteration must use
> + *     rhashtable_walk_*() instead.

[ ... ]

> diff --git a/lib/rhashtable.c b/lib/rhashtable.c
> index 04b3a808fca9..dd6eaa09c55d 100644
> --- a/lib/rhashtable.c
> +++ b/lib/rhashtable.c
> @@ -687,6 +687,75 @@ void *rhashtable_insert_slow(struct rhashtable *ht, const void *key,
> +static struct rhash_head *__rhashtable_next_in_table(
> +	struct rhashtable *ht, struct bucket_table *tbl,
> +	const void *prev_key)
> +{
> +	struct rhashtable_compare_arg arg = { .ht = ht, .key = prev_key };
> +	const struct rhashtable_params params = ht->p;
> +	struct rhash_head *he;
> +	unsigned int b = 0;
> +	bool found = false;
> +
> +	if (prev_key) {
> +		b = rht_key_hashfn(ht, tbl, prev_key, params);
> +		rht_for_each_rcu(he, tbl, b) {
> +			bool match = params.obj_cmpfn
> +				     ? !params.obj_cmpfn(&arg, rht_obj(ht, he))
> +				     : !rhashtable_compare(&arg, rht_obj(ht, he));
> +			if (found) {
> +				if (match)
> +					continue;
> +				return he;
> +			}
> +			if (match)
> +				found = true;
> +		}

Can this walk fail to make forward progress and trap an RCU-read-locked
caller in a cycle when duplicate keys are interleaved with a colliding
key in the same bucket?

This was raised as a concern in v6 and does not appear to have a code
change in response. Consider a bucket chain like:

	A2 -> B -> A1

where A1 and A2 share key A and B collides into the same bucket. With
prev_key=A, the walk matches the head A2 (found=true), then returns B.
With prev_key=B, it returns A1. With prev_key=A again, it matches the
head A2 and returns B, so the iterator oscillates between B and A
without advancing.

The duplicate-skip branch:

	if (found) {
		if (match)
			continue;
		return he;
	}

only collapses adjacent matches in the chain, so it does not cover the
interleaved A2 -> B -> A1 ordering.

The same chain can also arise from a concurrent delete and insert of key
A producing A_new -> B -> A_old, which would trap an iterator holding the
RCU read lock.

The kdoc in rhashtable.h now documents duplicate-key behavior as
undefined:

> + *   - Behavior on tables that contain duplicate keys is undefined:
> + *     duplicates may be skipped, repeated, or trap the walk in a
> + *     cycle. Callers requiring duplicate-key iteration must use
> + *     rhashtable_walk_*() instead.

and the commit message notes BPF codepaths cannot produce duplicate
keys. The code-level cycle is still reachable for any caller on a table
that does contain duplicate keys. Was a code-level guard considered for
the interleaved case, or is the documented "undefined" contract intended
to be the only mitigation?

This was discussed upstream:
https://lore.kernel.org/bpf/20260602205716.BCDBE1F00893@smtp.kernel.org/



AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

In-Reply-To-Subject: rhashtable: Add rhashtable_next_key() API
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/27011687425

@kernel-patches-review-bot

Copy link
Copy Markdown
commit 2bb32731d59830059ed6a2bd730eeb36574cbd3c
Author: Mykyta Yatsenko <yatsenko@meta.com>

bpf: Implement iteration ops for resizable hashtab

This commit implements get_next_key, batch lookup/lookup-and-delete,
for_each_map_elem callback, and the seq_file BPF iterator for
BPF_MAP_TYPE_RHASH.

> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
> index 10f3a058747b..a149713d0953 100644
> --- a/kernel/bpf/hashtab.c
> +++ b/kernel/bpf/hashtab.c

[ ... ]

> @@ -3020,8 +3020,79 @@ static void rhtab_map_free_internal_structs(struct bpf_map *map)
>  }
>
> +static long bpf_each_rhash_elem(struct bpf_map *map, bpf_callback_t callback_fn,
> +				void *callback_ctx, u64 flags)
> +{
> +	struct bpf_rhtab *rhtab = container_of(map, struct bpf_rhtab, map);
> +	void *prev_key = NULL;
> +	struct rhtab_elem *elem;
> +	int num_elems = 0;
> +	u64 ret = 0;
> +
> +	cant_migrate();
> +
> +	if (flags != 0)
> +		return -EINVAL;
> +
> +	rcu_read_lock();
> +	/*
> +	 * Best-effort iteration: if rhashtable is concurrently resized or
> +	 * elements are deleted/inserted, there may be missed or duplicate
> +	 * elements visited.
> +	 */
> +	while ((elem = rhashtable_next_key(&rhtab->ht, prev_key))) {
> +		if (IS_ERR(elem))
> +			break;

The rhashtable_next_key documentation notes that a full iteration is not
guaranteed to terminate under adversarial or sustained rehashing. Since
this loop lacks an external bound and runs under rcu_read_lock with
migration disabled, can continuous concurrent resizes cause an infinite
loop, resulting in a soft lockup or RCU stall?

If prev_key is deleted concurrently or by the callback itself,
rhashtable_next_key returns ERR_PTR(-ENOENT). Does breaking the loop here
cause the iteration to immediately abort and skip the rest of the map?

> +		num_elems++;
> +		ret = callback_fn((u64)(long)map,
> +				  (u64)(long)elem->data,
> +				  (u64)(long)rhtab_elem_value(elem, map->key_size),
> +				  (u64)(long)callback_ctx, 0);
> +		if (ret)
> +			break;
> +
> +		prev_key = elem->data;	/* valid while RCU held */
> +	}
> +	rcu_read_unlock();
> +
> +	return num_elems;
> +}

[ ... ]

> @@ -3034,6 +3105,275 @@ static u64 rhtab_map_mem_usage(const struct bpf_map *map)
>  	return sizeof(struct bpf_rhtab) + rhtab->elem_size * num_entries;
>  }
>
> +static int __rhtab_map_lookup_and_delete_batch(struct bpf_map *map,
> +					       const union bpf_attr *attr,
> +					       union bpf_attr __user *uattr,
> +					       bool do_delete)
> +{

[ ... ]

> +	rcu_read_lock();
> +
> +	/*
> +	 * Cursor stores the key of the next-to-process element (stashed by
> +	 * the previous batch). Look it up directly so the element is included
> +	 * here rather than skipped by next_key(). If the cursor was deleted
> +	 * concurrently (or by the previous do_delete batch), return -EAGAIN
> +	 * so userspace can distinguish a lost cursor from end-of-iteration
> +	 * (-ENOENT) and restart from a NULL cursor.
> +	 */
> +	if (ubatch) {
> +		elem = rhtab_lookup_elem(map, cursor);
> +		if (!elem) {
> +			rcu_read_unlock();
> +			ret = -EAGAIN;
> +			goto free;
> +		}
> +	} else {
> +		elem = rhashtable_next_key(&rhtab->ht, NULL);
> +	}
> +
> +	while (elem && !IS_ERR(elem) && total < max_count) {
> +		memcpy(dst_key, elem->data, key_size);
> +		rhtab_read_elem_value(map, dst_val, elem, elem_map_flags);
> +		check_and_init_map_value(map, dst_val);
> +
> +		if (do_delete)
> +			del_elems[total] = elem;
> +
> +		elem = rhashtable_next_key(&rhtab->ht, dst_key);
> +		dst_key += key_size;
> +		dst_val += value_size;
> +		total++;
> +
> +		/* Bail to userspace to avoid stalls. */
> +		if (need_resched())
> +			break;
> +	}

Similar to the callback iteration, if a concurrent deletion occurs and
elem receives -ENOENT, this loop breaks. If total is greater than 0 when
this happens, doesn't this falsely signal to userspace that the map is
fully iterated, leading to missed elements?

> +
> +	if (elem && !IS_ERR(elem)) {
> +		/* Stash next-to-process key as cursor for the next batch. */
> +		memcpy(cursor, elem->data, key_size);
> +		has_next_cursor = true;
> +	}
> +
> +	if (do_delete) {
> +		for (i = 0; i < total; i++)
> +			rhtab_delete_elem(rhtab, del_elems[i], NULL, 0);
> +	}
> +
> +	rcu_read_unlock();

Deferring the deletion until after the iteration leaves elements visible
in the hash table while they are being copied into user buffers. Does
this create a race window where concurrent batch operations can traverse,
copy, and yield the same elements to userspace, breaking the exactly-once
consumption semantics expected for work queues?

> +
> +	if (total == 0) {
> +		ret = -ENOENT;
> +		goto free;
> +	}
> +
> +	/* No more elements after this batch. */
> +	if (!has_next_cursor)
> +		ret = -ENOENT;
> +
> +	if (copy_to_user(ukeys, keys, (size_t)total * key_size) ||
> +	    copy_to_user(uvalues, values, (size_t)total * value_size) ||
> +	    put_user(total, &uattr->batch.count) ||
> +	    (has_next_cursor &&
> +	     copy_to_user(u64_to_user_ptr(attr->batch.out_batch),
> +			  cursor, key_size))) {
> +		ret = -EFAULT;
> +		goto free;
> +	}
> +
> +free:
> +	kfree(cursor);
> +	kvfree(keys);
> +	kvfree(values);
> +	kvfree(del_elems);
> +	return ret;
> +}

[ ... ]

These concerns were raised by sashiko-bot@kernel.org in v6 review:
https://lore.kernel.org/bpf/20260602210416.2E0631F00893@smtp.kernel.org/



AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

In-Reply-To-Subject: bpf: Implement iteration ops for resizable hashtab
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/27011687425

@kernel-patches-review-bot

Copy link
Copy Markdown
commit 19dff21d2fed6076d1ca196fa711f085ceafe756
Author: Mykyta Yatsenko <yatsenko@meta.com>

bpf: Optimize word-sized keys for resizable hashtable

Specialize the lookup/update/delete paths for keys whose size matches
sizeof(long) (4 bytes on 32-bit, 8 bytes on 64-bit). A static-const
rhashtable_params lets the compiler inline a custom XOR-fold hashfn and
a single-word equality cmpfn, eliminating the indirect jhash dispatch.
The same hashfn and cmpfn are installed into rhashtable's stored params
at rhashtable_init time, so the rehash worker, slow-path inserts, and
rhashtable_next_key() all agree with the inlined fast paths.

The seq_file BPF iterator uses rhashtable_walk_* and is unaffected.

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>

> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
> index 7b9408b8320c..b4366cad3cfa 100644
> --- a/kernel/bpf/hashtab.c
> +++ b/kernel/bpf/hashtab.c
> @@ -2763,6 +2763,31 @@ static inline void *rhtab_elem_value(struct rhtab_elem *l, u32 key_size)
>  	return l->data + round_up(key_size, 8);
>  }
>
> +/* Specialize hash function and objcmp for long sized key */
> +static __always_inline int rhtab_key_cmp_long(struct rhashtable_compare_arg *arg,
> +					      const void *ptr)
> +{
> +	const unsigned long key1 = *(const unsigned long *)arg->key;
                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Could dereferencing this pointer directly cause alignment faults on
architectures that require strict alignment? Since BPF programs can pass
unaligned pointers such as from packet data in skb->data directly to map
helpers, using get_unaligned() might be necessary here to prevent kernel
panics on ARM32, SPARC, and other strict-alignment architectures.

This concern was raised by sashiko-bot@kernel.org in v5 and v6:
https://lore.kernel.org/bpf/20260528182522.23D121F00A3C@smtp.kernel.org/

The author acknowledged in v5 that "the same problem exists for the htab
jhash2" but did not address it in subsequent versions. The issue remains
in this version.

> +	const struct rhtab_elem *key2 = ptr;
> +
> +	return key1 != *(const unsigned long *)key2->data;
> +}
> +
> +static __always_inline u32 rhtab_hashfn_long(const void *data, u32 len, u32 seed)
> +{
> +	u64 k = *(const unsigned long *)data;
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> +
> +	return (u32)(k ^ (k >> 32)) ^ seed;
                                    ^^^^^^

Does applying the seed after folding the key make hash collisions entirely
independent of the seed?

If a BPF program or an attacker inserts keys constructed as:

    k = (Y << 32) | (X ^ Y)

These keys will all fold to exactly X before the XOR with the seed, meaning:

    f(key1) ^ seed == f(key2) ^ seed

for any random seed.

If rhashtable detects a long bucket chain and triggers a background rehash
with a new seed, won't all these colliding keys move together to the same new
bucket? This might trap the rhashtable worker in an infinite loop of
rehashing, leading to a 100 percent CPU livelock.

This concern was raised by sashiko-bot@kernel.org in v5 and v6:
https://lore.kernel.org/bpf/20260528182522.23D121F00A3C@smtp.kernel.org/

The author acknowledged in v5 that "the attack is possible in theory" but
decided to "leave for performance" without addressing the vulnerability. The
issue remains in this version.

> +}
> +
> +static const struct rhashtable_params rhtab_params_long = {
> +	.head_offset = offsetof(struct rhtab_elem, node),
> +	.key_offset  = offsetof(struct rhtab_elem, data),
> +	.key_len     = sizeof(long),
> +	.hashfn      = rhtab_hashfn_long,
> +	.obj_cmpfn   = rhtab_key_cmp_long,
> +};
> +
>  static struct bpf_map *rhtab_map_alloc(union bpf_attr *attr)
>  {
>  	struct rhashtable_params params;

[ ... ]



AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

In-Reply-To-Subject: bpf: Optimize word-sized keys for resizable hashtable
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/27011687425

@kernel-patches-daemon-bpf kernel-patches-daemon-bpf Bot force-pushed the bpf-next_base branch 11 times, most recently from 4f5632b to 970af1b Compare June 7, 2026 19:43
@kernel-patches-daemon-bpf kernel-patches-daemon-bpf Bot force-pushed the bpf-next_base branch 8 times, most recently from 3a26044 to 818f7b1 Compare June 10, 2026 04:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants