CA-428620: fix occasional vm suspend error#7134
Conversation
9a46060 to
74337af
Compare
data-cant-suspend-reason is not a reliable basis to decide whether a VM can be suspended, in some corner case, vm suspend operation runs into error, State blocked by non-migratable device 0000:00:07.0/nvme Signed-off-by: Chunjie Zhu <chunjie.zhu@citrix.com>
74337af to
a1564d4
Compare
gthvn1
left a comment
There was a problem hiding this comment.
Interesting. It looks good to me, just few questions.
| let restore _ _cb vm vbds vifs data vgpu_data extras = | ||
| with_lock m (restore_nolock vm vbds vifs data vgpu_data extras) | ||
|
|
||
| let resume _ _vm = () |
| let lack_feature feature = not (has_feature ~vmgmr ~feature) in | ||
| match op with | ||
| | (`suspend | `checkpoint | `pool_migrate | `migrate_send) when is_live -> ( | ||
| | (`pool_migrate | `migrate_send) when is_live -> ( |
There was a problem hiding this comment.
I think I understand the split but don't we have the same problem of false negatives here?
There was a problem hiding this comment.
Actually the migration will do the suspend also.
| let lack_feature feature = not (has_feature ~vmgmr ~feature) in | ||
| match op with | ||
| | (`suspend | `checkpoint | `pool_migrate | `migrate_send) when is_live -> ( | ||
| | (`pool_migrate | `migrate_send) when is_live -> ( |
There was a problem hiding this comment.
With this, migrations are still blocked in xapi, while suspend is allowed.
There's already an assert_can_save in VM_migrate in xenopsd, should it not be moved together with suspend and checkpoint here?
| (* Cannot gate suspend/checkpoint on the cached data-cant-suspend-reason: | ||
| that xenstore key is set by the QMP event thread whenever a QMP event | ||
| arrives and Query_migratable returns an error (e.g. NVMe has transient | ||
| in-flight I/O). By the time the actual suspend executes, the device may | ||
| be idle. xenopsd performs a fresh Query_migratable check via | ||
| assert_can_save before committing to the save sequence, which is the | ||
| authoritative check. *) |
There was a problem hiding this comment.
It seems to me that originally the xapi uses the data-cant-suspend-reason in guest metrics to gate for suspend, checkpoint, pool_migrate and migrate_send. Meanwhile, the migration (both pool_migrate and migrate_send) implementation in xenopsd uses B.VM.assert_can_save to gate again.
This change works only for the value in xenstore being cleaned up between xapi gate and xenopsd gate? If this is true, I would think a retry on client side is fine.
data-cant-suspend-reason is not a reliable basis to decide
whether a VM can be suspended, in some corner case, vm suspend
operation runs into error,
State blocked by non-migratable device 0000:00:07.0/nvme