Skip to content

Fix kptr dtor deadlock#12417

Open
kernel-patches-daemon-bpf[bot] wants to merge 4 commits into
bpf-next_basefrom
series/1108815=>bpf-next
Open

Fix kptr dtor deadlock#12417
kernel-patches-daemon-bpf[bot] wants to merge 4 commits into
bpf-next_basefrom
series/1108815=>bpf-next

Conversation

@kernel-patches-daemon-bpf

Copy link
Copy Markdown

Pull request for series with
subject: Fix kptr dtor deadlock
version: 3
url: https://patchwork.kernel.org/project/netdevbpf/list/?series=1108815

@kernel-patches-daemon-bpf

Copy link
Copy Markdown
Author

Upstream branch: c15261b
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=1108815
version: 3

@kernel-patches-review-bot

Copy link
Copy Markdown
commit 99dd649358244d70fab2b545944a19a59e9a169a
Author: Justin Suess <utilityemal77@gmail.com>

bpf: Reject bpf_obj_drop() from tracing progs

This patch adds NMI safety checks for bpf_obj_drop() and
bpf_percpu_obj_drop() when called from tracing programs. It introduces a
whitelist approach for NMI-safe BTF field types and rejects drops of
objects with NMI-unsafe fields from tracing contexts. The patch also
converts a test case from tracing to syscall to comply with the new
restriction.

> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 62bba7a4876f..0654d2ffadc1 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -492,6 +492,35 @@ static inline bool btf_record_has_field(const struct btf_record *rec, enum btf_f
>  	return rec->field_mask & type;
>  }
>
> +static inline bool btf_field_is_nmi_safe(enum btf_field_type type)
> +{
> +	switch (type) {
> +	case BPF_SPIN_LOCK:
> +	case BPF_RES_SPIN_LOCK:
> +	case BPF_TIMER:
> +	case BPF_WORKQUEUE:
> +	case BPF_TASK_WORK:
> +	case BPF_KPTR_UNREF:
> +	case BPF_REFCOUNT:
> +		return true;
> +	default:
> +		return false;
> +	}
> +}

Is it safe to include BPF_TIMER and BPF_WORKQUEUE in this NMI-safe
whitelist? Looking at the cleanup paths, dropping an object with these
fields calls bpf_timer_cancel_and_free() or bpf_wq_cancel_and_free(),
which synchronously invoke bpf_prog_put() via
bpf_async_update_prog_callback():

kernel/bpf/helpers.c:bpf_async_update_prog_callback()
    bpf_prog_put()
        __bpf_prog_put()
            schedule_work()

If this drops the last reference to the BPF program while in NMI context,
__bpf_prog_put() falls back to schedule_work() since irqs are disabled.
Since schedule_work() takes a raw spin lock on the workqueue pool, couldn't
this hard deadlock if the NMI interrupted another thread that was already
holding the workqueue pool lock?

BPF_TASK_WORK safely defers via irq_work_queue(), but these fields do not
appear to have the same protection.

This concern was raised by sashiko-bot in v1:
https://lore.kernel.org/bpf/20260608154008.27E011F00893@smtp.kernel.org/

Would it be safer to exclude BPF_TIMER and BPF_WORKQUEUE from the
NMI-safe whitelist?

> +
> +static inline bool btf_record_has_nmi_unsafe_fields(const struct btf_record *rec)
> +{
> +	int i;
> +
> +	if (IS_ERR_OR_NULL(rec))
> +		return false;
> +	for (i = 0; i < rec->cnt; i++) {
> +		if (!btf_field_is_nmi_safe(rec->fields[i].type))
> +			return true;
> +	}
> +	return false;
> +}

This isn't a bug, but these helper functions were suggested to be moved
from the header file into verifier.c since they are only used there:
https://lore.kernel.org/bpf/6461e40c-d5a9-41a5-aaba-bf063de67d3f@gmail.com/

Is there a reason to keep them in the header?

[ ... ]



AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

In-Reply-To-Subject: bpf: Reject bpf_obj_drop() from tracing progs
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/27234422852

@kernel-patches-daemon-bpf

Copy link
Copy Markdown
Author

@kernel-patches-review-bot

Copy link
Copy Markdown
commit 45e7d1059deddcbecdad88fa065d982fd50210a0
Author: Kumar Kartikeya Dwivedi <memxor@gmail.com>

selftests/bpf: Exercise unsafe obj drops from tracing progs

This adds task_kfunc failure cases for bpf_obj_drop() on local objects
with referenced kptr fields from tracing and NMI tracing programs, which
must be rejected because dropping the object would run full special-field
destruction synchronously in an unsafe context.

> diff --git a/tools/testing/selftests/bpf/progs/task_kfunc_failure.c b/tools/testing/selftests/bpf/progs/task_kfunc_failure.c
> index 8e947d445f8e..8942b5478129 100644
> --- a/tools/testing/selftests/bpf/progs/task_kfunc_failure.c
> +++ b/tools/testing/selftests/bpf/progs/task_kfunc_failure.c

[ ... ]

> @@ -233,6 +234,45 @@ int BPF_PROG(task_kfunc_release_unacquired, struct task_struct *task, u64 clone_
>  	return 0;
>  }
>
> +SEC("tp_btf/task_newtask")
> +__failure __msg("bpf_obj_drop cannot be used in tracing programs on types with NMI unsafe fields")
> +int BPF_PROG(task_kfunc_obj_drop_with_kptr, struct task_struct *task, u64 clone_flags)
> +{

[ ... ]

> +SEC("tp_btf/task_newtask")
> +__failure __msg("bpf_obj_drop cannot be used in tracing programs on types with NMI unsafe fields")
> +int BPF_PROG(task_kfunc_obj_drop_nmi_with_kptr, struct task_struct *task,
> +	     u64 clone_flags)
> +{

Does task_kfunc_obj_drop_nmi_with_kptr actually exercise the NMI tracing
context its name and the commit message refer to?

It uses SEC("tp_btf/task_newtask"), the same standard tracing section as
task_kfunc_obj_drop_with_kptr just above it, so it would fail at load time
for the same verifier reason rather than from an NMI-specific path.

Would an NMI context such as BPF_PROG_TYPE_PERF_EVENT / SEC("perf_event")
be needed here to actually cover the NMI tracing case?

As written the two tests appear to provide redundant coverage and leave the
intended NMI context untested.

For context, an earlier version used SEC("tp_btf/nmi_handler"), which was
x86-only; switching to SEC("tp_btf/task_newtask") fixed that portability
issue but no longer exercises any NMI context.  This was raised in review
on the list and appears unaddressed in the current version:

  https://lore.kernel.org/bpf/20260609202548.3571690-4-memxor@gmail.com

> +	struct __tasks_kfunc_map_value *local;
> +	struct task_struct *acquired, *old;
> +
> +	(void)clone_flags;
> +
> +	local = bpf_obj_new(typeof(*local));
> +	if (!local)
> +		return 0;
> +
> +	acquired = bpf_task_acquire(task);
> +	if (acquired) {
> +		old = bpf_kptr_xchg(&local->task, acquired);
> +		if (old)
> +			bpf_task_release(old);
> +	}
> +
> +	bpf_obj_drop(local);
> +	return 0;
> +}



AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

In-Reply-To-Subject: selftests/bpf: Exercise unsafe obj drops from tracing progs
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/27234422852

@kernel-patches-review-bot

Copy link
Copy Markdown
commit 863c40dec4ed733033e59d026643bab30ee49170
Author: Justin Suess <utilityemal77@gmail.com>

    bpf: Cancel special fields on map value recycle

    Map update and delete paths currently call bpf_obj_free_fields() when a
    value is being replaced or recycled. That makes field destruction depend
    on the context of the update/delete operation. For tracing programs this
    can include NMI context, where referenced kptr destructors, uptr
    unpinning, and graph root destruction are not generally safe.

    Introduce bpf_obj_cancel_fields() for the reusable-value path. It only
    performs NMI-safe cleanup for timer, workqueue, and task_work fields.
    Fields that need full destruction are left attached to the recycled value
    and are destroyed by the final cleanup path instead.

    Switch array and hashtab update/delete/recycle paths to this cancel
    helper. Keep bpf_obj_free_fields() for final map destruction and for
    bpf_mem_alloc destructors. Preallocated hashtabs do not have allocator
    destructors, so teardown continues to walk the normal and extra elements
    and fully destroy their fields.

    This deliberately relaxes the eager-free semantics of map update/delete
    for special fields. Programs that relied on a recycled map slot becoming
    empty immediately after update/delete were relying on behavior that
    cannot be implemented safely from every BPF execution context without
    offloading arbitrary destructors.

    There is a chance this change breaks programs making assumptions
    regarding the eager freeing of fields. If so, we can relax semantics to
    cancellation only when irqs_disabled() is true in the future. However,
    theoretically, map values that get reused eagerly already have weaker
    guarantees as parallel users can recreate freed fields before the new
    element becomes visible again.

    Fixes: 14a324f6a67e ("bpf: Wire up freeing of referenced kptr")
    Signed-off-by: Justin Suess <utilityemal77@gmail.com>
    Co-developed-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
    Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>

In the v1 discussion on lore, there was a suggestion about adding a
WARN_ON_ONCE for in_nmi in bpf_obj_free_fields to help detect NMI-related
bugs early:

> Would a WARN_ON_ONCE for in_nmi in bpf_obj_free_fields help spot these kind
> of bugs in the future?
>
> Usually calling bpf_obj_free_fields in_nmi won't result in a deadlock unless
> specific conditions are met (ie last reference to a refcounted object, using
> a workqueue, etc).
>
> Adding a check here would quickly surface bugs of this class.

Reference: https://lore.kernel.org/bpf/aicB9SZfVnUV1rz-@zenbox

While there was discussion explaining rationale against this approach due to
panic-on-warn concerns and existing guardrails, the suggestion itself was not
implemented. Would it be worth revisiting this as a future enhancement, perhaps
with a rate-limited warning or a config-gated check to help catch this class of
bugs during development without impacting production systems?



AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

In-Reply-To-Subject: bpf: Cancel special fields on map value recycle
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/27234422852

@kernel-patches-daemon-bpf

Copy link
Copy Markdown
Author

@kernel-patches-daemon-bpf

Copy link
Copy Markdown
Author

bpf_obj_drop() runs bpf_obj_free_fields() synchronously for
program-allocated objects. When such an object contains NMI unsafe
fields, tracing programs that can run from arbitrary instrumented
context can reach that destruction from unsafe contexts, including NMI.

NMI is likely one instance of this problem, and other instances would
include possible unsafe reentrancy. Deferring bpf_obj_drop() is not
appealing either: it would add delayed-free machinery to a release
operation that otherwise has straightforward synchronous ownership
semantics.

Reject bpf_obj_drop() and bpf_percpu_obj_drop() from tracing programs
that may run from unsafe contexts unless every field in the object's BTF
record is explicitly NMI safe. Do not reject sleepable
BPF_PROG_TYPE_TRACING programs, since they are not the arbitrary/NMI
contexts that motivate the restriction.

Note that while bpf_rb_root and bpf_list_head would be NMI safe on their
own to free, the objects recursively held by them may not be; be
conservative and just mark them as not NMI safe for now.

Use a whitelist for the NMI-safe field set instead of listing only known
NMI unsafe fields. Locks, async fields, unreferenced kptrs, and
refcounts are known to be NMI safe because their destruction is either a
no-op, simple state reset, or async cancellation. Referenced kptrs,
percpu referenced kptrs, uptrs, graph roots, graph nodes, and any future
field type are rejected until audited for arbitrary tracing and NMI
contexts. This is less susceptible to future changes in fields that were
previously safe by exclusion, and to new fields being added without
updating this check.

Convert the existing recursive local-object drop success case to a
syscall program in the same commit, since this verifier change makes the
old tracing program form invalid. The test still exercises
bpf_obj_drop() releasing a referenced task kptr from a safe program
type.

Fixes: ac9f060 ("bpf: Introduce bpf_obj_drop")
Signed-off-by: Justin Suess <utilityemal77@gmail.com>
Co-developed-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
@kernel-patches-daemon-bpf

Copy link
Copy Markdown
Author

Upstream branch: 140fa23
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=1108815
version: 3

RazeLighter777 and others added 3 commits June 9, 2026 21:31
Map update and delete paths currently call bpf_obj_free_fields() when a
value is being replaced or recycled. That makes field destruction depend
on the context of the update/delete operation. For tracing programs this
can include NMI context, where referenced kptr destructors, uptr
unpinning, and graph root destruction are not generally safe.

Introduce bpf_obj_cancel_fields() for the reusable-value path. It only
performs NMI-safe cleanup for timer, workqueue, and task_work fields.
Fields that need full destruction are left attached to the recycled value
and are destroyed by the final cleanup path instead.

Switch array and hashtab update/delete/recycle paths to this cancel
helper. Keep bpf_obj_free_fields() for final map destruction and for
bpf_mem_alloc destructors. Preallocated hashtabs do not have allocator
destructors, so teardown continues to walk the normal and extra elements
and fully destroy their fields.

This deliberately relaxes the eager-free semantics of map update/delete
for special fields. Programs that relied on a recycled map slot becoming
empty immediately after update/delete were relying on behavior that
cannot be implemented safely from every BPF execution context without
offloading arbitrary destructors.

There is a chance this change breaks programs making assumptions
regarding the eager freeing of fields. If so, we can relax semantics to
cancellation only when irqs_disabled() is true in the future. However,
theoretically, map values that get reused eagerly already have weaker
guarantees as parallel users can recreate freed fields before the new
element becomes visible again.

Fixes: 14a324f ("bpf: Wire up freeing of referenced kptr")
Signed-off-by: Justin Suess <utilityemal77@gmail.com>
Co-developed-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Add task_kfunc failure cases for bpf_obj_drop() on local objects with
referenced kptr fields from tracing and NMI tracing programs. These programs
must be rejected because dropping the object would run full special-field
destruction synchronously in an unsafe context.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Add focused map_kptr coverage for BPF-side map updates that touch values
containing referenced kptrs.

The new syscall programs stash the testmod refcounted object in an array
map, a preallocated hash map, and a no-prealloc hash map, then update the
same map from BPF. The refcount must remain elevated after the update,
while the userspace runner destroys the skeleton and reuses the existing
refcount wait to confirm map teardown releases the kptr.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants