Skip to content

Speed up and stabilize integration tests (parallelism + deploy retry)#2451

Draft
GarrettBeatty wants to merge 8 commits into
devfrom
gcbeatty/fix-custom-authorizer-deploy-retry
Draft

Speed up and stabilize integration tests (parallelism + deploy retry)#2451
GarrettBeatty wants to merge 8 commits into
devfrom
gcbeatty/fix-custom-authorizer-deploy-retry

Conversation

@GarrettBeatty

@GarrettBeatty GarrettBeatty commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Summary

Started as a fix for a flaky CI failure in TestCustomAuthorizerApp.IntegrationTests, then grew into a broader effort to speed up and stabilize the integration-test phase (which dominated CI wall-clock by running everything serially), plus fixes for a few flaky unit tests surfaced along the way.

Reliability

  • Retry custom-authorizer deployment on transient IAM role propagation. The fixture's CloudFormation deploy intermittently rolled back with "The role defined for the function cannot be assumed by Lambda" — a transient IAM eventual-consistency race. DeploymentScript.ps1 now retries the deploy (deleting the rolled-back stack between attempts, since ROLLBACK_COMPLETE can't be re-created) and surfaces CloudFormation failure events.

Speed

  • Run the integration-test projects in parallel. run-integ-tests now runs each *.IntegrationTests.csproj concurrently (run-integ-tests-parallel.ps1); each project deploys its own isolated stack, so they share no state.
  • Stack-scoped Lambda lookup. LambdaHelper.FilterByCloudFormationStackAsync uses CloudFormation ListStackResources instead of scanning every Lambda in the account and reading each function's tags — O(stack) instead of O(account), and no shared-account throttling.
  • In-project test parallelism. TestServerlessApp and TestCustomAuthorizerApp share their single deployed-stack fixture across the assembly via IAssemblyFixture instead of one serial [Collection], so the test classes run in parallel (stack still deploys once).
  • Durable suite parallelism. Enabled parallel execution for the durable integ suite (was maxParallelThreads=1).
  • Publish durable test functions once, in a single MSBuild pass. A generated traversal project (Restore;Publish, BuildInParallel) builds the shared dependency projects once and publishes every function to its own bin/publish; tests then only zip the output — replacing per-test cold publishing.

Making durable parallelism safe (rate limits & races)

Enabling parallelism in the durable suite surfaced a series of shared-resource contention issues, fixed in layers:

  • IAM throttling → replaced per-test IAM roles with a single shared execution role (created at most once per account, reused across runs); dispose no longer deletes roles.
  • Lambda control-plane throttling → share the AWS clients statically so adaptive retry coordinates backoff across deployments, and cap concurrent control-plane calls (CreateFunction/DeleteFunction/GetFunctionConfiguration) with a suite-wide gate.
  • Shared-file races → idempotent dotnet tool install across the parallel deploy scripts, and zip each function package to a unique temp path (a function used by more than one test was being zipped to a shared path concurrently).

Developer experience

  • Live integ-test output. The parallel runner streams each project's output line-by-line (prefixed with the project name) instead of buffering until completion; failed projects get a clean reprinted block.

Flaky unit-test fixes (unrelated to the integ work, surfaced in CI)

  • Durable suspend tests (InvokeOperationTests et al.): replaced fixed Task.Delay waits before asserting suspension with a deterministic await on the termination signal (TerminationManager.TerminationTask), bounded by a timeout — the fixed delays raced under CI thread-pool pressure.
  • FileDescriptorLogStream test: the test helper trimmed trailing null bytes from captured output, which flaked ~1/256 of the time when a log header's timestamp ended in 0x00 (16-byte header read as 15). Now captures exactly the bytes written.

Testing

  • Affected projects build clean; PowerShell scripts parse.
  • Verified against AWS where the local environment allowed: custom-authorizer deploys its stack once and all 20 tests pass under the parallel IAssemblyFixture setup; the shared-role + single-pass publish path works (51/51 functions publish in one MSBuild pass); the previously-flaky unit tests pass and are now deterministic.
  • The remaining end-to-end durable-suite timing/throttling is being validated on real CI (the contention fixes above were each driven by a CI run).

…ation failure

The TestCustomAuthorizerApp integration test stack deploys many Lambda
functions that reference IAM roles created in the same stack. CloudFormation
occasionally calls Lambda CreateFunction before the role's trust policy has
propagated through IAM, producing "The role defined for the function cannot
be assumed by Lambda" and rolling the whole stack back, which fails all 20
tests in the project.

Wrap the deploy in a retry loop (3 attempts). Between attempts, delete the
rolled-back stack (a ROLLBACK_COMPLETE stack cannot be re-created) and pause
briefly to let IAM settle. Surface CloudFormation failed-resource events on
each failure for easier debugging.
@GarrettBeatty GarrettBeatty added the Release Not Needed Add this label if a PR does not need to be released. label Jun 26, 2026
The integration-test phase ran everything serially and dominated CI wall-clock.
Four independent changes cut that down:

- run-integ-tests now runs each *.IntegrationTests.csproj concurrently
  (buildtools/run-integ-tests-parallel.ps1). Each project deploys its own
  isolated CloudFormation stack, so they share no state. Replaces the serial
  MSBuild item-batched Exec.

- LambdaHelper.FilterByCloudFormationStackAsync now lists the stack's resources
  via CloudFormation ListStackResources instead of scanning every Lambda in the
  account and reading each function's tags one at a time. O(stack size) instead
  of O(account size), and no longer throttles in a shared test account.

- TestServerlessApp and TestCustomAuthorizerApp integ tests share their single
  deployed-stack fixture across the assembly via IAssemblyFixture (the
  Xunit.Extensions.AssemblyFixture package) instead of one serial
  [Collection]. The stack still deploys once, but the test classes now run in
  parallel.

- The durable execution integ suite (45 independent tests, each deploying its
  own uniquely-named function) no longer forces maxParallelThreads=1; its build
  helper already guards concurrent publishes with a per-directory file lock.

Verified end-to-end against AWS: TestCustomAuthorizerApp deploys its stack once
and all 20 tests pass under the parallel AssemblyFixture setup.
@GarrettBeatty GarrettBeatty changed the title Retry TestCustomAuthorizerApp deployment on transient IAM role propagation failure Speed up and stabilize integration tests (parallelism + deploy retry) Jun 27, 2026
@GarrettBeatty GarrettBeatty reopened this Jun 27, 2026
…letion

The parallel runner captured each project's output with Out-String and only
printed it after the project finished, so nothing appeared during the long
integration-test run. Stream each line to the host as it arrives, prefixed with
the project name so the interleaved parallel logs stay attributable. Failed
projects still get their full output reprinted as one clean block at the end.
…fixed delay

InvokeOperationTests.InvokeAsync_FreshExecution_CheckpointsStartAndSuspends
failed intermittently on net10.0 (e.g. CI run on PR #2451). The suspend-path
tests kicked off an operation, slept a fixed 10-50ms, then asserted
tm.IsTerminated. Under CI thread-pool pressure the suspend signal didn't always
fire within that window, so the assert raced and failed.

TerminationManager already exposes TerminationTask, a Task that completes
exactly when Terminate() fires. Replace the fixed delays with a shared
tm.WaitForTerminationAsync() helper that awaits that task (bounded by a 10s
timeout so a genuine non-suspension still fails fast at the assert). Applied to
all 13 suspend-gated sites across 5 test files.

Verified: full suite passes on net8.0 and net10.0, and the previously-flaky
test passed 25/25 consecutive runs on net10.0. Also faster — tests resume the
instant suspension fires instead of always sleeping.
Running the durable integ suite in parallel (maxParallelThreads=4) surfaced two
contention problems that this addresses.

IAM 'Rate exceeded': each test created and deleted its own IAM role, so several
deployments hammered IAM's (global, single-bucket, low-rate) mutating APIs at
once. Replace per-test roles with a single shared execution role
(durable-integ-shared-execution-role) created at most once per account and
reused across tests and runs, gated so concurrent deployments don't race. It
carries the union of permissions every scenario needs (invoke durable-integ-*
functions + send durable-execution callbacks); no test depends on a role
lacking a permission, so one role is safe. Dispose no longer deletes roles.
Clients also use adaptive retry as a backstop.

Build thrash/timeouts: each test published its function separately and wiped
obj/bin first, so the shared source projects (Amazon.Lambda.DurableExecution
etc.) were rebuilt per-test, and concurrent publishes thrashed MSBuild into
'dotnet timed out'. Publish all functions once, up front, in a single MSBuild
pass via a generated traversal project (Restore;Publish, BuildInParallel) that
builds the shared projects once and publishes each function to its own
bin/publish; tests then only zip that output. Verified: 51/51 functions publish
in one ~16s pass with 0 errors, and the suite no longer throttles IAM.
MaxSizeProducesOneLogFrame intermittently failed with 'Expected: 16, Actual: 15'
on the header length. The header ends with an 8-byte big-endian microsecond
timestamp; roughly 1 in 256 timestamps ends in a 0x00 byte. TestFileStream's
Write captured bytes via TrimTrailingNullBytes(buffer).Take(count), which
stripped that legitimate trailing zero, yielding a 15-byte header.

Capture exactly buffer[offset, offset + count) instead — that is precisely what
the production code wrote, and it no longer depends on the timestamp's value.
After the shared-role fix removed IAM throttling, the throttling moved to
Lambda's account-wide control-plane APIs: with maxParallelThreads=4, the
combination of CreateFunction + DeleteFunction + WaitForFunctionActive polling
GetFunctionConfiguration exceeded Lambda's limits, surfacing as 'Rate exceeded'
and adaptive retry's 'capacity could not be obtained'.

Two compounding causes addressed:

- Each deployment built its own AWS clients, so adaptive retry's per-client
  rate limiter couldn't coordinate across the parallel deployments — N clients
  each assumed they had capacity and fired at once. Make the Lambda and IAM
  clients static/shared so adaptive retry actually paces the whole suite.

- Cap concurrent Lambda control-plane calls (create/delete/get-configuration)
  with a suite-wide semaphore (limit 2) via a RunControlPlaneAsync helper, so
  the 4 parallel test threads don't collectively exceed Lambda's control-plane
  rate. Data-plane calls (Invoke, durable-execution reads) are not gated. Also
  slow the WaitForFunctionActive poll from 2s to 3s to cut its call rate.
The CI run no longer throttles IAM or Lambda control-plane (those fixes held),
but parallelism surfaced two shared-file races:

- 'Cannot create .../dotnet/tools/.store/amazon.lambda.tools/6.0.6 because a
  file or directory with the same name already exists': the three
  *.IntegrationTests projects run DeploymentScript.ps1 in parallel and each ran
  'dotnet tool install -g Amazon.Lambda.Tools', colliding on the global tool
  store. Make the install idempotent: skip if already installed, and tolerate
  the concurrent-install race (already-installed/already-exists treated as
  success) with a short retry.

- 'function.zip ... being used by another process' (ApproverFunction): a test
  function that is the external function for more than one test was zipped to a
  shared bin/function.zip by multiple parallel tests at once. Zip to a unique
  temp path per call instead; the read-only published output is still shared.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Release Not Needed Add this label if a PR does not need to be released.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant