Skip to content

Fix BigQuery Storage Write API stream count for batch writes#38777

Closed
SPUERSAIYAN wants to merge 4 commits into
apache:masterfrom
SPUERSAIYAN:master
Closed

Fix BigQuery Storage Write API stream count for batch writes#38777
SPUERSAIYAN wants to merge 4 commits into
apache:masterfrom
SPUERSAIYAN:master

Conversation

@SPUERSAIYAN
Copy link
Copy Markdown

What changed

Fixes #38770

This fixes handling of num_storage_api_streams / num_streams for BigQuery Storage Write API batch writes.

Previously, the Java schema transform only applied num_streams inside the unbounded pipeline branch, so bounded batch pipelines using the Storage Write API ignored the configured fixed stream count. This change applies withNumStorageWriteApiStreams(...) whenever num_streams > 0, while keeping triggering frequency and auto-sharding behavior limited to unbounded pipelines.

Details

  • Applies fixed Storage Write API stream counts to bounded and unbounded schema transform writes.
  • Keeps auto_sharding as an unbounded-only option.
  • Removes outdated documentation that described num_storage_api_streams / num_streams as streaming-only.
  • Updates Java, Python, and website docs to reflect batch support.
  • Adds a bounded pipeline translation test that verifies fixed stream counts produce the expected batch redistribute step.

Validation

  • git diff --check
  • ./gradlew.bat --no-daemon --console=plain :sdks:java:io:google-cloud-platform:spotlessCheck

CHANGES.md

Not updated because this is a targeted behavior/documentation fix rather than a broad user-facing feature addition.


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).

See the [Contributor Guide](https://beam.apache.org/contribute) for more tips on [how to make review process smoother](https://github.com/apache/beam/blob/master/CONTRIBUTING.md#make-the-reviewers-job-easier).

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests
Go tests

See [CI.md](https://github.com/apache/beam/blob/master/CI.md) for more information about GitHub Actions CI or the [workflows README](https://github.com/apache/beam/blob/master/.github/workflows/README.md) to see a list of phrases to trigger workflows.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses an issue where batch pipelines using the BigQuery Storage Write API were unable to utilize fixed stream counts. By modifying the schema transform logic, the changes ensure that configured stream counts are respected in batch contexts, while maintaining existing auto-sharding behavior exclusively for streaming pipelines. The update also cleans up documentation and includes a validation test to ensure consistent behavior across pipeline types.

Highlights

  • BigQuery Storage Write API Support: Enabled the use of fixed stream counts for batch pipelines using the BigQuery Storage Write API, allowing for better control over parallelism.
  • Logic Refinement: Restricted auto-sharding behavior to unbounded pipelines while ensuring fixed stream counts are correctly applied to both bounded and unbounded writes.
  • Documentation Updates: Updated Java, Python, and website documentation to clarify that fixed stream counts are supported for both batch and streaming pipelines.
  • Testing: Added a new bounded pipeline translation test to verify that fixed stream counts correctly trigger the expected redistribute step.
New Features

🧠 You can now enable Memory (public preview) to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the BigQuery Storage Write API implementation and documentation to support configuring the number of write streams for both batch and streaming pipelines, rather than limiting it to streaming pipelines. It modifies the schema transform provider to apply the stream count configuration to both bounded and unbounded writes and adds a corresponding unit test. The review feedback highlights two issues in the schema transform provider: a lack of validation for negative stream counts and an issue where explicitly setting the stream count to zero is ignored, preventing users from overriding pipeline options. A code suggestion is provided to resolve these issues.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 2, 2026

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 3, 2026

Assigning reviewers:

R: @tvalentyn for label python.
R: @Abacn for label java.
R: @Abacn for label website.

Note: If you would like to opt out of this review, comment assign to next reviewer.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

@stankiewicz
Copy link
Copy Markdown
Contributor

I see there is also #38776 , which looks a bit simpler.

@SPUERSAIYAN
Copy link
Copy Markdown
Author

我看到还有#38776,看起来比较简单一些。

Thanks for pointing this out. I understand that #38776 is another PR with a smaller change to this issue.

If #38776 already fixes the problem, I’m happy to close this PR to avoid duplicate work

@stankiewicz stankiewicz closed this Jun 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Fix num_storage_api_streams for batch

2 participants