if you think about it, chatgpt is such an incredibly fascinating tool. it can magically answer any and every question that you and i have.
but how does chatgpt answer all the questions in the big round world? is it just remembering the answers? what does really happen when they say, they trained a new model? what's training?
wait? what's a model?
in simple terms, a language model is a program that looks at the text you gave it, figures out which parts matter, and then guesses the next piece of text based on patterns it learned.
and how did it learn patterns? training.
but what's training? training basically consists of 2 parts:
1. letting the model predict the next token for a set of tokens and finding a loss
2. propogating that loss to each cog in the system that helped predicting this token
sounds p simple right?
let's look at it into a little more detail.
you have a token, "cat" which represented by a vector in dimension d.
dimension d is a very important number in training. increasing the dimension increases the cost quadratically but it also makes the models learn much more nuanced patterns such as world knowledge, code relationships etc
we will see in some time as to why it quadratically increases the cost.
for now embedding is a vector representation of a token in dimension d. it contains the normal embedding and the positional embedding of the token.
x_t = token_embedding[tok_id] + positional_embedding[t]
this also acts as the input to the first transformer block. let's see what happens in each of the transformer blocks.
just after entering, we go through a layer normalization process which basically just stabilizes it. but what do we mean by "stabilize"?
it forces every token vector to have a mean 0 and variance 1. so now half the values are less than 0 and half the values are greater than 0. what this does is that it normalizes the values of the token such that no value is crazy high(leading to gradient explosion(i love this term)) or crazy low(vanishing gradients)
now we have 3 weight matrices.
W_Q: what am i looking for Q[i]=h[i] x W_Q where h[i] is the output of layer norm
W_K: what do i contain similar
W_V: what information do i send forward
the definitions of weight matrices purely explain what each token vector multiplication with this specific weight matrix will result in.
a score of similarity between 2 tokens is calculated using Q[i] dot product with K[j] and then we divide it with root of dk where dk is dimensions/attention head
attention head is a fairly complicated topic by itself, let's look at that some other day. for now just assume it to be 4 so you have 4 types of attention heads doing the process i define below paralelly. each attention head is like a different "expert lens" that looks at the sequence in its own specialized way.
why do we divide it by dk though
so we can make sure the score doesn't explode resulting in softmax explosion which is the next step in the process. softmax basically converts the scores to probabilities.
it has a complicated logarithmic formula but it basically tells that how much token i focus on token j
if we had a set of tokens "a cat sat on a mat" for the token "mat" the highest softmax probability would be the token sat, second highest would be cat, third highest would be on and so on.
now we have the weights[i,j] which is the softmax of score[i,j]. now we comput the attention output which basically sums up all the softmax of score[i,j] for all j for a specific i and takes the multiplication of it with Vj(the same Vj which we get from multiple h[i] x W_V
Imagine a meeting.
Q[i] = What you want to learn
K[j] = What each person knows
V[j] = What each person actually says
softmax(weights) = How much attention you give to each speaker
you don't add up "what people want" or "what knowledge they hold."
you add up what they say (V) weighted by how much you listen (weights).
that is EXACTLY what attention does. this is an analogy by chatgpt btw which i found really easy to understand
now we concatenate all such results for all our attention heads in a single vector like [[y[i]1, y[i]2..] in our case we have assumed the heads to be 4.
now finally we multiply this matrix with Wo which is a learned matrix which kinda unifies all the attention heads and determines if a head is valuable or not during backpropation(story for another day)
now we have a z[i] after multiplying with W_O then we finally add original x[i] (x_t = token_embedding[tok_id] + positional_embedding[t]) to avoid vanishing and exploding gradients and to normalize such values because if you keep multiple small numbers they will vanish and if you keep multiplying large numbers they will explode.
again we layer normalize the output of this layer and provide it to a black box called GELU which creates non linearity to get more features from our output. it basically multiplies it with a larger matrix so dimensions are expanded and we have more space to learn new features such as oh is this a noun, is this a part form of verb(how? in short, the vector it multiplies with a larger vector which has these patterns definedd and then we multiple this result and then downsize it so now we have those features much more imminent)
now we have the final h[i] from this one layer/transformer. we have the same thing from the other 200 layers. then we get the logits(vector of dimension vocab size) using h[i] x W(unembedding matrix) then we take the softmax to create probabilities
and loss is just - log(probability(c)) where c is the index of the true next token. so all of it would be positive unless it was the true token so loss would be something like [0.5, 0.3, -0.8] so now the weights would be changed so that they push the 3rd token higher for the current set of tokens and that's how they improve.
exact back prop, let's see in the next conversation.
really appreciate anyone reading this until here, say hi in the comments btw and ask me what you found confusing!