azure_kubernetes_service: add management plane operations#6749
azure_kubernetes_service: add management plane operations#6749ashishsuneja wants to merge 6 commits into
Conversation
- GetNodePoolNames: lists current AKS node pools - CreateNodePoolAsync: creates nodepool, returns np_succeeded handle - DeleteNodePoolAsync: deletes nodepool, returns np_gone handle - UpgradeNodePoolAsync: upgrades nodepool, returns np_succeeded handle - UpdateClusterAsync: toggles tag for non-destructive update - WaitForOperation: polls nodepool/cluster until terminal state - ResolveNodePoolVersions: auto-detects initial/target versions - Retry logic via vm_util.Retry (max 5, 30s interval) Tested: - Existing AKS test suite passing - pyink + lint-diffs clean
3b67e05 to
d79ec41
Compare
| ) | ||
| return initial, target | ||
|
|
||
| def WaitForOperation(self, op_handle: str) -> None: |
There was a problem hiding this comment.
so AKS doesn't have proper ids & therefore you have to split/format the op handle in a weird way huh? Understood this is needed as it's a top level method signature used from the benchmark. Add some examples in function comments & to unit tests.
There was a problem hiding this comment.
Added WaitForOperation docstring examples showing the 3 handle formats (np_succeeded:, np_gone:, cluster_succeeded) and 7 unit tests covering happy paths and error paths for all 3 kinds (testWaitForOperationNpSucceeded, testWaitForOperationNpSucceededFailed, testWaitForOperationNpSucceededNotFound, testWaitForOperationNpGone, testWaitForOperationClusterSucceeded, testWaitForOperationClusterFailed, testWaitForOperationInvalidHandle
| f'nodepool {name} state={status}' | ||
| ) | ||
|
|
||
| @vm_util.Retry( |
There was a problem hiding this comment.
lots of duplicate code here. Can you create a shared function these call & call it from the others?
Some ideas:
- function signature would take a list of args & return like some object with either "not found", "succeeded", "failed" or "other error".
- Then the caller can decide whether to eg raise or silently return on not found.
- shared function can have the vm_util.Retry wrapper around it
There was a problem hiding this comment.
Done — extracted _WaitForProvisioningState(cmd, resource_desc, not_found_is_success, retryable_exception_type) which wraps vm_util.Retry and handles Succeeded/Failed/NotFound/in-progress uniformly. WaitForOperation now just selects the right cmd and args for each handle kind and delegates to it — the three inner functions (WaitNpSucceeded, WaitNpGone, WaitClusterSucceeded) are gone.
| return 'cluster_succeeded' | ||
| except (ValueError, KeyError, json.JSONDecodeError) as e: | ||
| logging.info('[AKS] UpdateClusterAsync: pool parse error: %s', e) | ||
| # Fallback: tag update |
There was a problem hiding this comment.
why do we need a fallback? When does a fallback happen instead of above code?
There was a problem hiding this comment.
Removed along with the scale branch (Comment 4) — there's no fallback anymore since UpdateClusterAsync now only does the tag update.
| azure.AZURE_PATH, | ||
| 'aks', | ||
| 'nodepool', | ||
| 'scale', |
There was a problem hiding this comment.
how is a nodepool scale operation an "update" ?
IMO just stick with the tag & never attempt a scale. Looks weird & potentially produces non-deterministic results.
There was a problem hiding this comment.
Done — removed the scale branch entirely. UpdateClusterAsync now only does the tag update, which is always cluster-scoped and always a real change (unix timestamp value).
| ) | ||
| logging.info('Successfully applied BlobFuse PVC.') | ||
|
|
||
| def CreateNodePool( |
There was a problem hiding this comment.
remind me again why we need async vs sync versions of these?
Also looking at #6751 which is titled "sync operations wrap async operations" - it doesn't actually seem to do that & only modifies kubernetes_management_benchmark.
After you've convinced me we need sync & async versions, just wrap them in this PR. it should be fewer lines of code & therefore simpler & easier to review.
There was a problem hiding this comment.
| if node_version: | ||
| # _GetNodeFlags may have added self.cluster_version; replace or append. | ||
| if '--kubernetes-version' in node_flags: | ||
| node_flags[node_flags.index('--kubernetes-version') + 1] = node_version |
There was a problem hiding this comment.
IMO if it is in node_flags, remove it first. Then either way second step add it back in.
There was a problem hiding this comment.
Done — both CreateNodePool and CreateNodePoolAsync now use remove-then-append: strip any existing --kubernetes-version that _GetNodeFlags may have added from self.cluster_version, then always append the caller-supplied version.
| node_flags[node_flags.index('--kubernetes-version') + 1] = node_version | ||
| else: | ||
| node_flags += ['--kubernetes-version', node_version] | ||
| cmd = [ |
There was a problem hiding this comment.
this cmd list of args looks the same as the list of args in _CreateNodepool above. It should share code with it. This might be the same as the wrapping comment, but generally too much duplication here.
There was a problem hiding this comment.
Done — extracted _BuildNodePoolAddCmd(nodepool_config, node_flags, no_wait=False) shared by both CreateNodePool and CreateNodePoolAsync. The only difference between them is no_wait=True for the async variant.
Comment 4: Remove nodepool scale from UpdateClusterAsync -- stick with tag update only (always cluster-scoped, always a real change). Comment 3: Fallback comment removed along with the scale branch. Comment 7: Extract _BuildNodePoolAddCmd to share the az aks nodepool add argument list between CreateNodePool and CreateNodePoolAsync. Comment 6: Replace replace-or-append --kubernetes-version pattern with remove-then-append in both nodepool create methods. Comment 2: Extract _WaitForProvisioningState to share the retry/poll pattern across np_succeeded, np_gone, and cluster_succeeded waits. Comment 1: Add WaitForOperation docstring examples and 7 unit tests covering np_succeeded/np_gone/cluster_succeeded happy + error paths.
CreateNodePool, DeleteNodePool, UpgradeNodePool, UpdateCluster now call their *Async counterpart + WaitForOperation instead of raising NotImplementedError. Providers only need to implement *Async methods. Also adds try/except in _PostCreate for Python 3.14 pickling issue with event poller — logs warning and continues without polling rather than crashing PKB. Tested: - 4/4 sync wrapper tests passing - pyink + lint-diffs clean
…pers Same fix as container_service_mock.py (PR GoogleCloudPlatform#6746) -- GetNodePoolNames is now abstract on KubernetesCluster (added by the sync-wraps-async cherry- pick from GoogleCloudPlatform#6751) so all TestKubernetesCluster subclasses need a stub.
Summary
Adds AKS-specific implementations of management plane async methods.
Main changes
UpgradeNodePoolAsync, UpdateClusterAsync, WaitForOperation,
ResolveNodePoolVersions
How tested