Blog

RL or reinforcement learning is the study of agents and how they learn from trial and error. the primary goal is to learn an optimal policy that maximizes reward in each state. what even is a policy? a policy is a function that takes the current state and outputs the action the agent should take. how does policy magically know what to do? to answer this, we must know what a Q function is. a Q function determines how good it is to take an action A in a state S Q[state][action] = number but how do we know the future rewards? we don't. we mostly try to predict it using neural networks. how do we do that? and what's a neural network? a neural network is a giant mathematical function with weights. weights that can be tweaked during training to get to a desired state where the action taken by agent at state S always produces the highest value of Q function i.e. the reward. how does it do that? it starts off as a dumb network. not knowing what to do. much like humans. its Q values are nonsensical. it doesn't know what's right and what's wrong. how does it get better? bellman's equation. just wait. loss = (prediction – target)^2 in the ideal world we want a function that will tell us future rewards of taking an action A at state S. Q(s,a)=r+γa′maxQ(s′,a′) Where: r = immediate reward s′ = next state y = discount (0.99 usually) max⁡Q(s′,a′)= best future value from next state pretty intuitive right? its kinda like recursion but in reverse. Q-value now = reward now + Q-value later. but how does this help train a neural network? so now we are at a state, and have an action A, the network will split a number for Q(S,A). now we will go and do the bellman which is r + γa′maxQ(s′,a′) so basically take the current reward(decided by the environment) and do discount times what the neural network thinks the maximum of any s,a is once we have done this action already. so now we can predict the loss which is target - the current value by neural network squared. our aim is to reduce the loss at every step. that's it. how do we do that? adjust the weights in the neural network. who does that. for now think of it like a black box called optimizer which adjusts the weights(we will see more of that in day 2) so here is my thought: that's how we should think about building software for consumers as well. see what the highest reward for them is. backpropagate and change our direction to give the highest reward to the consumer. so if i were a social media company, the reward for me is the action most done in my app. what's that action? what precedes that action? can i do stuff to make it easier to go to that action? that's all for today. we have a long way to go.