Skip to content

Guard compute_mfu against a zero denominator (world_size == 0)#6174

Open
CharlesCNorton wants to merge 1 commit into
huggingface:mainfrom
CharlesCNorton:fix-compute-mfu-zero-division
Open

Guard compute_mfu against a zero denominator (world_size == 0)#6174
CharlesCNorton wants to merge 1 commit into
huggingface:mainfrom
CharlesCNorton:fix-compute-mfu-zero-division

Conversation

@CharlesCNorton

@CharlesCNorton CharlesCNorton commented Jun 25, 2026

Copy link
Copy Markdown

What does this PR do?

compute_mfu (in trl/trainer/utils.py) divides by peak_flops_per_device * world_size with no guard, so a zero denominator (world_size == 0, or a zero peak_flops_per_device) raises ZeroDivisionError: float division by zero:

from trl.trainer.utils import compute_mfu

compute_mfu(flops_per_token=1000, tokens_per_second=5000.0, world_size=0)
# ZeroDivisionError: float division by zero

compute_mfu is a metric helper for the trainer's logging path, so a degenerate input should not be able to abort training. This returns 0.0 when the denominator is zero (no compute capacity to utilize, so MFU is 0) instead of raising; normal inputs are unchanged.

A regression test is added to TestComputeMfu in tests/test_utils.py.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline, Pull Request section?
  • Was this discussed/approved via a GitHub issue? Please add a link to it if that's the case.
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

AI writing disclosure

We welcome the use of AI tools to help with contributions. For transparency and to help us improve our review process, please indicate the level of AI involvement in this PR.

  • No AI usage: the PR was written entirely by a human.
  • AI-assisted: some parts were suggested or improved by AI, but the PR was written and reviewed by a human.
  • AI-generated: the PR was mostly or fully generated by an AI tool.

Who can review?

Anyone in the community is free to review the PR once the tests have passed.


Note

Low Risk
Tiny change to a metric helper with no effect on normal training inputs; only avoids crashes on edge-case logging values.

Overview
compute_mfu no longer divides when peak_flops_per_device * world_size is zero (e.g. world_size == 0 or zero peak FLOPs). It returns 0.0 instead of raising ZeroDivisionError, so degenerate inputs on the trainer MFU logging path cannot crash training. Valid inputs behave the same as before.

A regression test test_zero_world_size_returns_zero was added in TestComputeMfu.

Reviewed by Cursor Bugbot for commit 3839e9a. Bugbot is set up for automated code reviews on this repo. Configure here.

compute_mfu divided by peak_flops_per_device * world_size with no guard, raising ZeroDivisionError when world_size (or peak_flops_per_device) is 0. Return 0.0 in that case so a metric on the logging path cannot abort training.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant