Train your own R1 reasoning model with Unsloth
(unsloth.ai)
from yogthos@lemmy.ml to technology@lemmy.ml on 08 Feb 2025 03:01
https://lemmy.ml/post/25750435
from yogthos@lemmy.ml to technology@lemmy.ml on 08 Feb 2025 03:01
https://lemmy.ml/post/25750435
Unsloth introduces reasoning capabilities in their platform using Group Relative Policy Optimization. GRPO allows users to transform standard models into reasoning models locally with as little as 7GB VRAM. Previously, GRPO was only supported for full fine-tuning, but now it works with QLoRA and LoRA.
It optimizes responses efficiently without requiring a value function, unlike Proximal Policy Optimization. Use cases for GRPO include creating customized models with rewards or generating reasoning processes for input-output data.
threaded - newest