Skip to content

[Test] Make the retrieval of flexible instance types more resilient to prevent test failures.#7448

Merged
gmarciani merged 2 commits into
aws:developfrom
gmarciani:wip/mgiacomo/3160/fix-test-dfsm-0612-1
Jun 12, 2026
Merged

[Test] Make the retrieval of flexible instance types more resilient to prevent test failures.#7448
gmarciani merged 2 commits into
aws:developfrom
gmarciani:wip/mgiacomo/3160/fix-test-dfsm-0612-1

Conversation

@gmarciani

@gmarciani gmarciani commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Description of changes

Make the retrieval of flexible instance types more resilient to prevent test failures.

In particular:

  1. Retry the retrieval to be robust against transient failures (networking glithces or throttling)
  2. in case of consistent failure, emit a warning log and fall back to the original instance type.
  3. sort the list of equivalent instance types so that multiple calls to the function always returns the same result.
  4. Add log line in get_similar_instance_types to facilitate troubleshooting.

The definitive solution to reduce even more the risk of failures is to cache the result, which is wip in #7402

Tests

SUCCEEDED with cluster config using the expected list of flex instance types. Also checked that every time the test retrieves the list of flex instance type the list is always the same.

test-suites:
  update:
    test_update.py::test_dynamic_file_systems_update:
      dimensions:
      - instances:
        - c5.xlarge
        oss:
        - ubuntu2404
        regions:
        - eu-west-2
        schedulers:
        - slurm

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@gmarciani gmarciani requested review from a team as code owners June 12, 2026 17:04
@gmarciani gmarciani added skip-changelog-update Disables the check that enforces changelog updates in PRs Test labels Jun 12, 2026
@gmarciani gmarciani changed the title [Test] Add log line in get_similar_instance_types to facilitate troubleshooting [Test] Make the retrieval of flexible instance types more resilient to prevent test failures. Jun 12, 2026
 In particular:
 1. Retry the retrieval to be robust against transient failures (networking glithces or throttling)
 2. in case of consistent failure, emit a warning log and fall back to the original instance type.
 3. sort the list of equivalent instance types so that multiple calls to the function always returns the same result.
@gmarciani gmarciani force-pushed the wip/mgiacomo/3160/fix-test-dfsm-0612-1 branch from 5c4b053 to 13bd11c Compare June 12, 2026 18:39
@gmarciani gmarciani merged commit caeaeb2 into aws:develop Jun 12, 2026
19 checks passed
@gmarciani gmarciani deleted the wip/mgiacomo/3160/fix-test-dfsm-0612-1 branch June 12, 2026 20:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

skip-changelog-update Disables the check that enforces changelog updates in PRs Test

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants