Speed up and stabilize integration tests (parallelism + deploy retry)#2451
Draft
GarrettBeatty wants to merge 8 commits into
Draft
Speed up and stabilize integration tests (parallelism + deploy retry)#2451GarrettBeatty wants to merge 8 commits into
GarrettBeatty wants to merge 8 commits into
Conversation
…ation failure The TestCustomAuthorizerApp integration test stack deploys many Lambda functions that reference IAM roles created in the same stack. CloudFormation occasionally calls Lambda CreateFunction before the role's trust policy has propagated through IAM, producing "The role defined for the function cannot be assumed by Lambda" and rolling the whole stack back, which fails all 20 tests in the project. Wrap the deploy in a retry loop (3 attempts). Between attempts, delete the rolled-back stack (a ROLLBACK_COMPLETE stack cannot be re-created) and pause briefly to let IAM settle. Surface CloudFormation failed-resource events on each failure for easier debugging.
The integration-test phase ran everything serially and dominated CI wall-clock. Four independent changes cut that down: - run-integ-tests now runs each *.IntegrationTests.csproj concurrently (buildtools/run-integ-tests-parallel.ps1). Each project deploys its own isolated CloudFormation stack, so they share no state. Replaces the serial MSBuild item-batched Exec. - LambdaHelper.FilterByCloudFormationStackAsync now lists the stack's resources via CloudFormation ListStackResources instead of scanning every Lambda in the account and reading each function's tags one at a time. O(stack size) instead of O(account size), and no longer throttles in a shared test account. - TestServerlessApp and TestCustomAuthorizerApp integ tests share their single deployed-stack fixture across the assembly via IAssemblyFixture (the Xunit.Extensions.AssemblyFixture package) instead of one serial [Collection]. The stack still deploys once, but the test classes now run in parallel. - The durable execution integ suite (45 independent tests, each deploying its own uniquely-named function) no longer forces maxParallelThreads=1; its build helper already guards concurrent publishes with a per-directory file lock. Verified end-to-end against AWS: TestCustomAuthorizerApp deploys its stack once and all 20 tests pass under the parallel AssemblyFixture setup.
…letion The parallel runner captured each project's output with Out-String and only printed it after the project finished, so nothing appeared during the long integration-test run. Stream each line to the host as it arrives, prefixed with the project name so the interleaved parallel logs stay attributable. Failed projects still get their full output reprinted as one clean block at the end.
…fixed delay InvokeOperationTests.InvokeAsync_FreshExecution_CheckpointsStartAndSuspends failed intermittently on net10.0 (e.g. CI run on PR #2451). The suspend-path tests kicked off an operation, slept a fixed 10-50ms, then asserted tm.IsTerminated. Under CI thread-pool pressure the suspend signal didn't always fire within that window, so the assert raced and failed. TerminationManager already exposes TerminationTask, a Task that completes exactly when Terminate() fires. Replace the fixed delays with a shared tm.WaitForTerminationAsync() helper that awaits that task (bounded by a 10s timeout so a genuine non-suspension still fails fast at the assert). Applied to all 13 suspend-gated sites across 5 test files. Verified: full suite passes on net8.0 and net10.0, and the previously-flaky test passed 25/25 consecutive runs on net10.0. Also faster — tests resume the instant suspension fires instead of always sleeping.
Running the durable integ suite in parallel (maxParallelThreads=4) surfaced two contention problems that this addresses. IAM 'Rate exceeded': each test created and deleted its own IAM role, so several deployments hammered IAM's (global, single-bucket, low-rate) mutating APIs at once. Replace per-test roles with a single shared execution role (durable-integ-shared-execution-role) created at most once per account and reused across tests and runs, gated so concurrent deployments don't race. It carries the union of permissions every scenario needs (invoke durable-integ-* functions + send durable-execution callbacks); no test depends on a role lacking a permission, so one role is safe. Dispose no longer deletes roles. Clients also use adaptive retry as a backstop. Build thrash/timeouts: each test published its function separately and wiped obj/bin first, so the shared source projects (Amazon.Lambda.DurableExecution etc.) were rebuilt per-test, and concurrent publishes thrashed MSBuild into 'dotnet timed out'. Publish all functions once, up front, in a single MSBuild pass via a generated traversal project (Restore;Publish, BuildInParallel) that builds the shared projects once and publishes each function to its own bin/publish; tests then only zip that output. Verified: 51/51 functions publish in one ~16s pass with 0 errors, and the suite no longer throttles IAM.
MaxSizeProducesOneLogFrame intermittently failed with 'Expected: 16, Actual: 15' on the header length. The header ends with an 8-byte big-endian microsecond timestamp; roughly 1 in 256 timestamps ends in a 0x00 byte. TestFileStream's Write captured bytes via TrimTrailingNullBytes(buffer).Take(count), which stripped that legitimate trailing zero, yielding a 15-byte header. Capture exactly buffer[offset, offset + count) instead — that is precisely what the production code wrote, and it no longer depends on the timestamp's value.
After the shared-role fix removed IAM throttling, the throttling moved to Lambda's account-wide control-plane APIs: with maxParallelThreads=4, the combination of CreateFunction + DeleteFunction + WaitForFunctionActive polling GetFunctionConfiguration exceeded Lambda's limits, surfacing as 'Rate exceeded' and adaptive retry's 'capacity could not be obtained'. Two compounding causes addressed: - Each deployment built its own AWS clients, so adaptive retry's per-client rate limiter couldn't coordinate across the parallel deployments — N clients each assumed they had capacity and fired at once. Make the Lambda and IAM clients static/shared so adaptive retry actually paces the whole suite. - Cap concurrent Lambda control-plane calls (create/delete/get-configuration) with a suite-wide semaphore (limit 2) via a RunControlPlaneAsync helper, so the 4 parallel test threads don't collectively exceed Lambda's control-plane rate. Data-plane calls (Invoke, durable-execution reads) are not gated. Also slow the WaitForFunctionActive poll from 2s to 3s to cut its call rate.
The CI run no longer throttles IAM or Lambda control-plane (those fixes held), but parallelism surfaced two shared-file races: - 'Cannot create .../dotnet/tools/.store/amazon.lambda.tools/6.0.6 because a file or directory with the same name already exists': the three *.IntegrationTests projects run DeploymentScript.ps1 in parallel and each ran 'dotnet tool install -g Amazon.Lambda.Tools', colliding on the global tool store. Make the install idempotent: skip if already installed, and tolerate the concurrent-install race (already-installed/already-exists treated as success) with a short retry. - 'function.zip ... being used by another process' (ApproverFunction): a test function that is the external function for more than one test was zipped to a shared bin/function.zip by multiple parallel tests at once. Zip to a unique temp path per call instead; the read-only published output is still shared.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Started as a fix for a flaky CI failure in
TestCustomAuthorizerApp.IntegrationTests, then grew into a broader effort to speed up and stabilize the integration-test phase (which dominated CI wall-clock by running everything serially), plus fixes for a few flaky unit tests surfaced along the way.Reliability
DeploymentScript.ps1now retries the deploy (deleting the rolled-back stack between attempts, sinceROLLBACK_COMPLETEcan't be re-created) and surfaces CloudFormation failure events.Speed
run-integ-testsnow runs each*.IntegrationTests.csprojconcurrently (run-integ-tests-parallel.ps1); each project deploys its own isolated stack, so they share no state.LambdaHelper.FilterByCloudFormationStackAsyncuses CloudFormationListStackResourcesinstead of scanning every Lambda in the account and reading each function's tags — O(stack) instead of O(account), and no shared-account throttling.TestServerlessAppandTestCustomAuthorizerAppshare their single deployed-stack fixture across the assembly viaIAssemblyFixtureinstead of one serial[Collection], so the test classes run in parallel (stack still deploys once).maxParallelThreads=1).Restore;Publish,BuildInParallel) builds the shared dependency projects once and publishes every function to its ownbin/publish; tests then only zip the output — replacing per-test cold publishing.Making durable parallelism safe (rate limits & races)
Enabling parallelism in the durable suite surfaced a series of shared-resource contention issues, fixed in layers:
CreateFunction/DeleteFunction/GetFunctionConfiguration) with a suite-wide gate.dotnet tool installacross the parallel deploy scripts, and zip each function package to a unique temp path (a function used by more than one test was being zipped to a shared path concurrently).Developer experience
Flaky unit-test fixes (unrelated to the integ work, surfaced in CI)
InvokeOperationTestset al.): replaced fixedTask.Delaywaits before asserting suspension with a deterministic await on the termination signal (TerminationManager.TerminationTask), bounded by a timeout — the fixed delays raced under CI thread-pool pressure.FileDescriptorLogStreamtest: the test helper trimmed trailing null bytes from captured output, which flaked ~1/256 of the time when a log header's timestamp ended in0x00(16-byte header read as 15). Now captures exactly the bytes written.Testing
IAssemblyFixturesetup; the shared-role + single-pass publish path works (51/51 functions publish in one MSBuild pass); the previously-flaky unit tests pass and are now deterministic.