← Back to home
oct 7

reinforcement fine tuning

openai just released their new agent builder! beyond the "it will destroy 17 yc startups", it seems like a very cool general public use product and one thing sticks out the most and no, it isn't the evals or the chat UI. its RFT: reinforcement fine tuning. traditionally models are trained by saying, "hey this is the input, this is the output" but RFT goes further, it lets the model explore, make mistakes, get scored and improve through trial and error. how does it do that? we start with a pre-trained model like gpt-5. it already understands our language, but we want it to act a certain way, use a certain tool in our workflow and so on you give the model some input prompts and let it produce several possible answers (called samples). prompt: "summarize this support ticket and classify if it is urgent." model outputs: a: "user reports login issue, needs password reset." → eh, maybe incomplete b: "user cannot access account; system error 503; high urgency." → much better reasoning c: "account issue, recommend escalation." → okayish but needs more explanation now, another model (or human grader) scores each output on how well it meets your criteria. this score is the reward signal. rewards can come from human feedback, custom grader script and automated metrics(factuality, latency, bla bla bla) the last step is the key step. the model's parameters are adjusted so that it increases the probability of actions (outputs) that got higher rewards. this is often done with a method like PPO (proximal policy optimization): a popular reinforcement learning algorithm that: - encourages good actions (high reward) - penalizes bad or unstable ones (low reward) conceptually: new_weights = old_weights + learning_rate * ∇(reward) the model literally learns which behaviors earn high scores and repeats them more often. pretty cool, right?
loading comments...