Blog

you are in a RL research engineer interview and the interviewer asks "what would you choose the judge to have in RL to determine whether the new response is better than the old one: absolute scoring or tagging(improved/not improved)" you don't know the right answer but you think that scoring feels more deterministic and that should be it. here is why its not: when the large language models are being trained, there is no defined dataset about what numbers actually mean. some dataset can say that 7/10 is average while some other can say that 5/10 is average and 7/10 is a good result. the model at the end of it all is super confused about numerical scoring and so a judge has no external anchor for scoring. its based on "vibes". that makes the scoring inconsistent and meaningless. LLMs tend to drift upwards in scores. why? during pre-training, LLMs are trained on human text. human beings are mostly polite and give positive and encouraging evaluations. 8/10 appears way more in text than 2/10 and 0/10 is almost non existent. RL is also supposed to go towards encouragement rather than penalty so the model learns that: harsh score → bad rater feedback constructive, positive score → good rater feedback that's exactly how reward hacking works. better however, always means the same thing. it doesn't matter if the model has gotten worse or better. it creates a direction gradient instead of the magnitude gradient(which is much more deterministic both for models and humans) imagine you updated weights based on numeric rewards: reward = +1 → slight improvement reward = +5 → big improvement your gradient step size is tied to a score scale that is unstable. on the other hand: improved → learn from this example not improved → ignore this example and so it kinda learns "go more in this direction" (not how far, just which way) which is a much more stable landscape for improving.

RL: absolute scoring or tagging