Blog

if you are reading this, you would learn: 1. why do image models suck at rendering text 2. why do colors get mixed up within objects in image models 3. if text models have gotten so better, why not the image models 4. what does the training look like for image models let's jump right into it: what's the fundamental difference between text and images? text is 1 dimensional while images are 2 dimensional. and so last 3 tokens can help you determine what the next token would be in the case of text but in the case of image, what does the last 3 token even mean? text is kinda like an array of numbers which follow a pattern. if you know the prev 3 numbers, you can figure out the pattern and the next number. images are like graphs. if you had to find the minimum amount of time the orange takes to be rotten, you have to know the amount of time it takes for each of its connected edge to be rotten. that's exactly what the concept of diffusion is based on. diffusion is based on the idea that if you take a patch of an image and generate it and then go generate the other patch around it and so on, this would result in a much better image than if you were predicting pixels one after the other. image pixels are very closely related to its neighboring pixels. you need neighboring consistency + global consistency. wait that adds absolutely no value to your understanding. idk why i said that. hmm maybe im confused about something? if image pixels are very closely related to its neighboring pixels, why can't we do auto regression like in text from multiple directions at once and estimate the pixel in between? can't we start in corners and move towards the center? why doesn't that work? idk, im waiting for your answer. so we understood the concept of diffusion but how do the image models really train. let's say you had an image. you added a small noise to it(gaussian noise cus i want to sound fancy) now you have an image that is slightly blurry. what happens if you keepp doing that over and over and over again? you would end up with just blur. can you reverse it? um sure? if i know the noise added, i can just subtract it and get the prev state back. do it n times and i have the image. that's exactly how training works. we try to estimate the noise added and adjust the parameters based on the loss between the actual noise that was added (since we know it from forward pass) and the noise the model predicted was added. that's the simple math behind it. how do we denoise the image? we have a prompt that identifies the denoised image so we give the same prompt to the noised image and back propagate the loss in each pass. but how does the image understand the text? 🥁🥁🥁🥁 drumrolls 🥁🥁🥁🥁🥁 crossssss attentionnnnnnn before cross attention, what does self attention mean? self attention basically means how much a specific token is related to any other token in the same modality damn thats the most jargony line i have written here so self attention for the image would say, oh hey the prompt said to draw the face of a human so the self attention part attending to left eye would say right eye seems to be kinda similar and need to match in pixels. cross attention however defines what the object should be i.e. a face should contain 2 eyes. think of it like a painting. when you are drawing a scenery and you want to put that half sun in between the mountains, you first identify where the mountain is and then you say oh i also need to put the sun: this is cross attention, the text determining what the object is supposed to be and where the self attention would determine how the sun is to be shaped and how do the pixels inside the sun match to outside the sun or inside to each other cross attention puts semantic guidance from the prompt into the model as a sort of help in denoising the image. it does not draw pixels. pixels are created by self attention. which has no understanding of letters inside, it is just trying to match the pixels with one another. that's absolutely why the text rendering inside the image models suck even in nano banana pro. the model doesn't inherently understand text and its just creating one pixel from the other pixel. texts requires finesse, the right dot at the right place, the right structure and the right connection. honestly thats too much for the guy who doesn't understand text at all. that's the only shortcoming though. if you told an earlier model which was just using diffusion to draw a red dog and a green cat, it would prolly do a shabby job here's why when the model is complete noise, how do you even determine which part attends to what? instead of the left and the right attending to dog and cat, diffusion would make them attend equally. wait im confused. why doesn't the text help in that? i mean the prompt oh there is also one more short coming, you said the red dog and a green cat but then again self attention of images would place a cat and dog together in animals and green and red together through colors im talking about the representation of embeddings here. so then you can kinda imagine the dog to look like a cat and the green to look like red because it can't really make out the difference. that's all in the past though. we as always have found solutions to such problems. one such solution is to first globally denoise the image. give it structure, here goes the cat, here goes the dog that's it. don't draw it. the second denoising(local) would actually generate the objects. another hack is to use a better embedding which doesn't just club nouns and animals together. i only read about 2 of them. happy to learn more. huh that's it. for today. you can just learn things. with loopholes in understanding but thanks chatgpt i guess :)

image models and their hate for text rendering