Skip to content

azure_kubernetes_service: add management plane operations#6749

Open
ashishsuneja wants to merge 6 commits into
GoogleCloudPlatform:masterfrom
ashishsuneja:mgmt_plane_aks
Open

azure_kubernetes_service: add management plane operations#6749
ashishsuneja wants to merge 6 commits into
GoogleCloudPlatform:masterfrom
ashishsuneja:mgmt_plane_aks

Conversation

@ashishsuneja

Copy link
Copy Markdown
Collaborator

Summary
Adds AKS-specific implementations of management plane async methods.

Main changes

  • GetNodePoolNames, CreateNodePoolAsync, DeleteNodePoolAsync,
    UpgradeNodePoolAsync, UpdateClusterAsync, WaitForOperation,
    ResolveNodePoolVersions
  • Retry logic via vm_util.Retry (max 5, 30s interval)

How tested

  • Existing AKS test suite passing
  • pyink + lint-diffs clean

- GetNodePoolNames: lists current AKS node pools
- CreateNodePoolAsync: creates nodepool, returns np_succeeded handle
- DeleteNodePoolAsync: deletes nodepool, returns np_gone handle
- UpgradeNodePoolAsync: upgrades nodepool, returns np_succeeded handle
- UpdateClusterAsync: toggles tag for non-destructive update
- WaitForOperation: polls nodepool/cluster until terminal state
- ResolveNodePoolVersions: auto-detects initial/target versions
- Retry logic via vm_util.Retry (max 5, 30s interval)

Tested:
- Existing AKS test suite passing
- pyink + lint-diffs clean
@ashishsuneja ashishsuneja force-pushed the mgmt_plane_aks branch 2 times, most recently from 3b67e05 to d79ec41 Compare June 17, 2026 19:25
)
return initial, target

def WaitForOperation(self, op_handle: str) -> None:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so AKS doesn't have proper ids & therefore you have to split/format the op handle in a weird way huh? Understood this is needed as it's a top level method signature used from the benchmark. Add some examples in function comments & to unit tests.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added WaitForOperation docstring examples showing the 3 handle formats (np_succeeded:, np_gone:, cluster_succeeded) and 7 unit tests covering happy paths and error paths for all 3 kinds (testWaitForOperationNpSucceeded, testWaitForOperationNpSucceededFailed, testWaitForOperationNpSucceededNotFound, testWaitForOperationNpGone, testWaitForOperationClusterSucceeded, testWaitForOperationClusterFailed, testWaitForOperationInvalidHandle

f'nodepool {name} state={status}'
)

@vm_util.Retry(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lots of duplicate code here. Can you create a shared function these call & call it from the others?
Some ideas:

  • function signature would take a list of args & return like some object with either "not found", "succeeded", "failed" or "other error".
  • Then the caller can decide whether to eg raise or silently return on not found.
  • shared function can have the vm_util.Retry wrapper around it

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — extracted _WaitForProvisioningState(cmd, resource_desc, not_found_is_success, retryable_exception_type) which wraps vm_util.Retry and handles Succeeded/Failed/NotFound/in-progress uniformly. WaitForOperation now just selects the right cmd and args for each handle kind and delegates to it — the three inner functions (WaitNpSucceeded, WaitNpGone, WaitClusterSucceeded) are gone.

return 'cluster_succeeded'
except (ValueError, KeyError, json.JSONDecodeError) as e:
logging.info('[AKS] UpdateClusterAsync: pool parse error: %s', e)
# Fallback: tag update

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need a fallback? When does a fallback happen instead of above code?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed along with the scale branch (Comment 4) — there's no fallback anymore since UpdateClusterAsync now only does the tag update.

azure.AZURE_PATH,
'aks',
'nodepool',
'scale',

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is a nodepool scale operation an "update" ?

IMO just stick with the tag & never attempt a scale. Looks weird & potentially produces non-deterministic results.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — removed the scale branch entirely. UpdateClusterAsync now only does the tag update, which is always cluster-scoped and always a real change (unix timestamp value).

)
logging.info('Successfully applied BlobFuse PVC.')

def CreateNodePool(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remind me again why we need async vs sync versions of these?
Also looking at #6751 which is titled "sync operations wrap async operations" - it doesn't actually seem to do that & only modifies kubernetes_management_benchmark.

After you've convinced me we need sync & async versions, just wrap them in this PR. it should be fewer lines of code & therefore simpler & easier to review.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Folded #6751 into this PR as suggested. kubernetes_cluster.py sync wrappers (CreateNodePool, DeleteNodePool, UpgradeNodePool, UpdateCluster) now call *Async + WaitForOperation instead of raising NotImplementedError — so providers only need to implement the async variants. PR #6751 is closed.

if node_version:
# _GetNodeFlags may have added self.cluster_version; replace or append.
if '--kubernetes-version' in node_flags:
node_flags[node_flags.index('--kubernetes-version') + 1] = node_version

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO if it is in node_flags, remove it first. Then either way second step add it back in.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — both CreateNodePool and CreateNodePoolAsync now use remove-then-append: strip any existing --kubernetes-version that _GetNodeFlags may have added from self.cluster_version, then always append the caller-supplied version.

node_flags[node_flags.index('--kubernetes-version') + 1] = node_version
else:
node_flags += ['--kubernetes-version', node_version]
cmd = [

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this cmd list of args looks the same as the list of args in _CreateNodepool above. It should share code with it. This might be the same as the wrapping comment, but generally too much duplication here.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — extracted _BuildNodePoolAddCmd(nodepool_config, node_flags, no_wait=False) shared by both CreateNodePool and CreateNodePoolAsync. The only difference between them is no_wait=True for the async variant.

Ashish Suneja added 3 commits July 1, 2026 15:52
Comment 4: Remove nodepool scale from UpdateClusterAsync -- stick with
tag update only (always cluster-scoped, always a real change).

Comment 3: Fallback comment removed along with the scale branch.

Comment 7: Extract _BuildNodePoolAddCmd to share the az aks nodepool
add argument list between CreateNodePool and CreateNodePoolAsync.

Comment 6: Replace replace-or-append --kubernetes-version pattern with
remove-then-append in both nodepool create methods.

Comment 2: Extract _WaitForProvisioningState to share the retry/poll
pattern across np_succeeded, np_gone, and cluster_succeeded waits.

Comment 1: Add WaitForOperation docstring examples and 7 unit tests
covering np_succeeded/np_gone/cluster_succeeded happy + error paths.
CreateNodePool, DeleteNodePool, UpgradeNodePool, UpdateCluster now
call their *Async counterpart + WaitForOperation instead of raising
NotImplementedError. Providers only need to implement *Async methods.

Also adds try/except in _PostCreate for Python 3.14 pickling issue
with event poller — logs warning and continues without polling rather
than crashing PKB.

Tested:
- 4/4 sync wrapper tests passing
- pyink + lint-diffs clean
Ashish Suneja added 2 commits July 1, 2026 20:36
…pers

Same fix as container_service_mock.py (PR GoogleCloudPlatform#6746) -- GetNodePoolNames is
now abstract on KubernetesCluster (added by the sync-wraps-async cherry-
pick from GoogleCloudPlatform#6751) so all TestKubernetesCluster subclasses need a stub.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants