fix: defer KG rebuild during batch document deletion#2819
Conversation
|
Deferring the refactoring of entity relationships is a sound approach; however, there are several issues in the current PR that need to be addressed:
|
|
Addressed the feedback — latest push includes:
|
|
This PR conflicts with PR #2826 ("make document deletion retry-safe"). Please resolve the conflicts before we proceed with the review. |
7e74326 to
7c7f84b
Compare
|
Rebased onto main, conflicts with #2826 resolved. |
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7c7f84b884
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if not skip_rebuild: | ||
| await self.doc_status.delete([doc_id]) | ||
| await self.full_docs.delete([doc_id]) |
There was a problem hiding this comment.
Preserve rebuild metadata when deferring cleanup
When skip_rebuild=True, this path skips doc_status deletion but still proceeds after stage 10, where full_entities/full_relations are removed before returning success. If the later deferred rebuild fails in background_delete_documents, a retry cannot reconstruct entities_to_rebuild/relationships_to_rebuild because the per-doc graph metadata is already gone, so the promised “re-trigger deletion to retry” flow cannot actually repair the KG and can leave stale summaries indefinitely.
Useful? React with 👍 / 👎.
| elif not rebuild_ok: | ||
| keep_msg = ( | ||
| f"Keeping doc_status for {len(successful_deletions)} docs " | ||
| f"because rebuild failed — re-trigger deletion to retry" | ||
| ) |
There was a problem hiding this comment.
Mark deferred rebuild failures as failed deletions
In the rebuild-failure branch, the code logs a warning but never reclassifies the previously successful_deletions as failed. The finally block then emits a completion summary from these counters, so users can see a fully successful batch even though doc cleanup was intentionally withheld due to rebuild failure; this makes operational status inaccurate and can hide that manual retry is still required.
Useful? React with 👍 / 👎.
7c7f84b to
18eaa8a
Compare
|
@danielaskdd Thanks for the review — both points were solid catches. Here's what I did: 1. Redundant tasks in overlapping pools: Before the deferred rebuild runs, I now check whether each entity/relation still exists in the graph. If doc A tagged an entity for "rebuild" but doc B's deletion later removed it entirely, the rebuild for that node gets skipped. The pruning loop sits right before 2. Error handling / idempotency: When 3. Conflict with #2826: Rebased on latest main (which includes #2826). Resolved cleanly, no merge conflicts. The |
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 415e12970c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if not skip_rebuild: | ||
| await self.doc_status.delete([doc_id]) | ||
| await self.full_docs.delete([doc_id]) |
There was a problem hiding this comment.
Preserve graph metadata until deferred rebuild succeeds
When skip_rebuild=True, this branch keeps doc_status for retry, but the same call has already deleted full_entities, full_relations, and full_docs earlier in adelete_by_doc_id. If the later batch-level rebuild_knowledge_from_chunks fails, re-running deletion for those doc IDs cannot reconstruct entities_to_rebuild / relationships_to_rebuild because the per-document graph metadata is gone, so the promised retry path can leave stale KG summaries indefinitely.
Useful? React with 👍 / 👎.
| keep_msg = ( | ||
| f"Keeping doc_status for {len(successful_deletions)} docs " | ||
| f"because rebuild failed — re-trigger deletion to retry" | ||
| ) |
There was a problem hiding this comment.
Mark batch as failed when deferred rebuild errors
If the deferred rebuild raises, rebuild_ok is set to False but no documents are moved from successful_deletions to failed_deletions; the function only logs a "keep doc_status" message. The final completion summary still reports those documents as successful, which can mislead operators/automation into treating a partially failed batch as complete even though cleanup was intentionally deferred for retry.
Useful? React with 👍 / 👎.
The finally block of `adelete_by_doc_id` calls `_insert_done()` whenever `deletion_operations_started=True`, which the no-chunks early-return path sets even though it only mutates `doc_status` and `full_docs`. `_insert_done()` then triggers `index_done_callback()` on every storage, including the ~62 MB GraphML serialization and the ~500 MB-each NanoVectorDB JSON dumps — for a graph that wasn't touched at all. For the canonical use case of bulk-deleting FAILED ingest documents (WebUI multi-select cleanup), this turns a 200-doc batch into ~300 GB of pure no-op disk writes. This patch branches the finally on `chunk_ids`: * `chunk_ids` non-empty (chunks were removed) → keep `_insert_done()`. The graph + VDBs were genuinely mutated and must be persisted. * `chunk_ids` empty (no-chunks path) → persist only the storages this path actually touched: `doc_status`, `full_docs`, and (when `delete_llm_cache=True` with persisted cache IDs) `llm_response_cache`. Composes cleanly with the existing W2 patch (PR HKUDS#2819 deferred KG rebuild). Both reduce the per-document cost of batch deletion; W5 helps the no-chunks subset specifically while W2 helps the chunk-bearing subset by deferring the LLM-driven graph re-extraction. Tests: tests/test_no_chunks_skip_persist.py exercises both the basic no-chunks path and the no-chunks + delete_llm_cache path, asserting exactly which storages get awaited. All four existing W2 tests still pass — the chunk-bearing path is unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Hi, checking in on this PR. I addressed the earlier feedback around overlapping rebuild/delete pools and deferred rebuild error handling, and GitHub still shows the branch clean with lint/offline tests green. Happy to adjust if there is anything else blocking review. |
When deleting N documents via background_delete_documents(), each call to adelete_by_doc_id() was triggering a full rebuild_knowledge_from_chunks(). For shared entities spanning multiple docs, this means the same summaries get recomputed N times -- 75x wasted LLM calls on an 85-doc batch. Added skip_rebuild parameter to adelete_by_doc_id(). When set, the per-doc rebuild is skipped and the affected entities/relationships are returned in DeletionResult so the caller can aggregate them. The batch delete loop now collects all targets and does a single rebuild pass at the end. Backwards compatible: skip_rebuild defaults to False, so single-doc deletion and direct API callers behave exactly as before. Closes HKUDS#2795
…failure Address review feedback on two issues: 1. When multiple documents share entities, a later deletion may fully remove an entity that an earlier deletion tagged for rebuild. Before the deferred rebuild, verify each entity/relation still exists in the graph and drop any that were already deleted — avoids wasted LLM calls and potential errors. 2. adelete_by_doc_id(skip_rebuild=True) now keeps doc_status alive. The batch caller only removes doc_status after a successful rebuild. If rebuild fails, docs remain visible so the user can re-trigger deletion (idempotent).
415e129 to
06b708f
Compare
|
Rebased this onto current upstream/main and resolved the conflict in Kept the PR's Validated locally on Windows:
|
06b708f to
bb1cfd2
Compare
|
Follow-up after the lint job: CI's pinned pre-commit ruff-format wanted one additional assert wrapping change in Revalidated locally on Windows:
|
bb1cfd2 to
aa81afb
Compare
|
Fixed the failing offline test after the rebase follow-up. The failure was from the document-route test double still using the old adelete_by_doc_id signature; it now records the skip_rebuild flag and asserts the batch/background path passes skip_rebuild=True. Validation: pytest tests/api/routes/test_document_routes_docx_archive.py::test_background_delete_removes_parser_hint_file_variants tests/test_batch_delete_deferred_rebuild.py -q passed (5 passed); pre-commit ruff passed on touched files; pre-commit ruff-format passed on touched files; py_compile on touched files passed; git diff --check passed. |
Summary
When deleting N documents via
background_delete_documents(), eachadelete_by_doc_id()call triggers a fullrebuild_knowledge_from_chunks(). For entities shared across documents, this means the same summaries get recomputed N times. On an 85-document batch, the issue author measured ~75x redundant LLM calls.This PR adds a
skip_rebuildparameter toadelete_by_doc_id()(defaults toFalsefor backwards compat). The batch deletion loop passesskip_rebuild=True, collects rebuild targets from each result, and does a single combined rebuild at the end.Changes:
lightrag/base.py—DeletionResultgains optionalentities_to_rebuild/relationships_to_rebuildfieldslightrag/lightrag.py—adelete_by_doc_id()conditionally skips rebuild whenskip_rebuild=True, populates rebuild targets on the resultlightrag/api/routers/document_routes.py—background_delete_documents()aggregates targets across loop, singlerebuild_knowledge_from_chunks()call at the endtests/test_batch_delete_deferred_rebuild.py— 4 tests covering both modesSingle-doc deletion (direct
adelete_by_doc_id()calls) is completely unaffected — same code path as before.Closes #2795