Skip to content

CA-428620: fix occasional vm suspend error#7134

Open
chunjiez wants to merge 1 commit into
masterfrom
private/chunjiez/bugfix
Open

CA-428620: fix occasional vm suspend error#7134
chunjiez wants to merge 1 commit into
masterfrom
private/chunjiez/bugfix

Conversation

@chunjiez

Copy link
Copy Markdown
Collaborator

data-cant-suspend-reason is not a reliable basis to decide
whether a VM can be suspended, in some corner case, vm suspend
operation runs into error,

State blocked by non-migratable device 0000:00:07.0/nvme

@chunjiez chunjiez force-pushed the private/chunjiez/bugfix branch from 9a46060 to 74337af Compare June 18, 2026 07:08
  data-cant-suspend-reason is not a reliable basis to decide
  whether a VM can be suspended, in some corner case, vm suspend
  operation runs into error,

  State blocked by non-migratable device 0000:00:07.0/nvme

Signed-off-by: Chunjie Zhu <chunjie.zhu@citrix.com>
@chunjiez chunjiez force-pushed the private/chunjiez/bugfix branch from 74337af to a1564d4 Compare June 18, 2026 07:19

@gthvn1 gthvn1 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. It looks good to me, just few questions.

let restore _ _cb vm vbds vifs data vgpu_data extras =
with_lock m (restore_nolock vm vbds vifs data vgpu_data extras)

let resume _ _vm = ()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you add this?

let lack_feature feature = not (has_feature ~vmgmr ~feature) in
match op with
| (`suspend | `checkpoint | `pool_migrate | `migrate_send) when is_live -> (
| (`pool_migrate | `migrate_send) when is_live -> (

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I understand the split but don't we have the same problem of false negatives here?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually the migration will do the suspend also.

let lack_feature feature = not (has_feature ~vmgmr ~feature) in
match op with
| (`suspend | `checkpoint | `pool_migrate | `migrate_send) when is_live -> (
| (`pool_migrate | `migrate_send) when is_live -> (

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this, migrations are still blocked in xapi, while suspend is allowed.

There's already an assert_can_save in VM_migrate in xenopsd, should it not be moved together with suspend and checkpoint here?

Comment on lines +219 to +225
(* Cannot gate suspend/checkpoint on the cached data-cant-suspend-reason:
that xenstore key is set by the QMP event thread whenever a QMP event
arrives and Query_migratable returns an error (e.g. NVMe has transient
in-flight I/O). By the time the actual suspend executes, the device may
be idle. xenopsd performs a fresh Query_migratable check via
assert_can_save before committing to the save sequence, which is the
authoritative check. *)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me that originally the xapi uses the data-cant-suspend-reason in guest metrics to gate for suspend, checkpoint, pool_migrate and migrate_send. Meanwhile, the migration (both pool_migrate and migrate_send) implementation in xenopsd uses B.VM.assert_can_save to gate again.

This change works only for the value in xenstore being cleaned up between xapi gate and xenopsd gate? If this is true, I would think a retry on client side is fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants