feat(`grpo_trainer.py`): STARE — Surprisal-guided Token-Level Advantage Reweighting by smellslikeml · Pull Request #6167 · huggingface/trl