you are in a RL research engineer interview and the interviewer asks "what would you choose the judge to have in RL to determine whether the new response is better than the old one: absolute scoring or tagging(improved/not improved)"
you don't know the right answer but you think that scoring feels more deterministic and that should be it.
here is why its not:
when the large language models are being trained, there is no defined dataset about what numbers actually mean. some dataset can say that 7/10 is average while some other can say that 5/10 is average and 7/10 is a good result. the model at the end of it all is super confused about numerical scoring and so a judge has no external anchor for scoring. its based on "vibes". that makes the scoring inconsistent and meaningless.
LLMs tend to drift upwards in scores. why? during pre-training, LLMs are trained on human text. human beings are mostly polite and give positive and encouraging evaluations. 8/10 appears way more in text than 2/10 and 0/10 is almost non existent. RL is also supposed to go towards encouragement rather than penalty so the model learns that:
harsh score → bad rater feedback
constructive, positive score → good rater feedback
that's exactly how reward hacking works.
better however, always means the same thing. it doesn't matter if the model has gotten worse or better. it creates a direction gradient instead of the magnitude gradient(which is much more deterministic both for models and humans)
imagine you updated weights based on numeric rewards:
reward = +1 → slight improvement
reward = +5 → big improvement
your gradient step size is tied to a score scale that is unstable. on the other hand:
improved → learn from this example
not improved → ignore this example
and so it kinda learns "go more in this direction" (not how far, just which way) which is a much more stable landscape for improving.