Guard compute_mfu against a zero denominator (world_size == 0) by CharlesCNorton · Pull Request #6174 · huggingface/trl

CharlesCNorton · 2026-06-25T04:11:36Z

What does this PR do?

compute_mfu (in trl/trainer/utils.py) divides by peak_flops_per_device * world_size with no guard, so a zero denominator (world_size == 0, or a zero peak_flops_per_device) raises ZeroDivisionError: float division by zero:

from trl.trainer.utils import compute_mfu

compute_mfu(flops_per_token=1000, tokens_per_second=5000.0, world_size=0)
# ZeroDivisionError: float division by zero

compute_mfu is a metric helper for the trainer's logging path, so a degenerate input should not be able to abort training. This returns 0.0 when the denominator is zero (no compute capacity to utilize, so MFU is 0) instead of raising; normal inputs are unchanged.

A regression test is added to TestComputeMfu in tests/test_utils.py.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline, Pull Request section?
Was this discussed/approved via a GitHub issue? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

AI writing disclosure

We welcome the use of AI tools to help with contributions. For transparency and to help us improve our review process, please indicate the level of AI involvement in this PR.

No AI usage: the PR was written entirely by a human.
AI-assisted: some parts were suggested or improved by AI, but the PR was written and reviewed by a human.
AI-generated: the PR was mostly or fully generated by an AI tool.

Who can review?

Anyone in the community is free to review the PR once the tests have passed.

Note

Low Risk
Tiny change to a metric helper with no effect on normal training inputs; only avoids crashes on edge-case logging values.

Overview
compute_mfu no longer divides when peak_flops_per_device * world_size is zero (e.g. world_size == 0 or zero peak FLOPs). It returns 0.0 instead of raising ZeroDivisionError, so degenerate inputs on the trainer MFU logging path cannot crash training. Valid inputs behave the same as before.

A regression test test_zero_world_size_returns_zero was added in TestComputeMfu.

^{Reviewed by Cursor Bugbot for commit 3839e9a. Bugbot is set up for automated code reviews on this repo. Configure here.}

compute_mfu divided by peak_flops_per_device * world_size with no guard, raising ZeroDivisionError when world_size (or peak_flops_per_device) is 0. Return 0.0 in that case so a metric on the logging path cannot abort training.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Guard compute_mfu against a zero denominator (world_size == 0)#6174

Guard compute_mfu against a zero denominator (world_size == 0)#6174
CharlesCNorton wants to merge 1 commit into
huggingface:mainfrom
CharlesCNorton:fix-compute-mfu-zero-division

CharlesCNorton commented Jun 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

CharlesCNorton commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

AI writing disclosure

Who can review?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

CharlesCNorton commented Jun 25, 2026 •

edited

Loading