MACHINE LEARNING BERKELEY Vision Transformers (ViTs) By:ML@B Edu Team
Vision Transformers (ViTs) By: ML@B Edu Team
MACHINE LEARNING BERKELEY Motivation Transformers work well for text-what happens if we use them on images? Transformers have some nice properties that could be useful for computer vision o ex.scalability.global receptive fields
Motivation ● Transformers work well for text → what happens if we use them on images? ● Transformers have some nice properties that could be useful for computer vision ○ ex. scalability, global receptive fields
MACHINE LEARNING BERKELEY Recall:Transformer Architecture Start with text string 1.→text tokens 2.-text embedding vectors(via embedding dictionary) 3.text/position embedding vectors 4.stacks transformer layers(self-attention normalization residual connections MLP blocks) 5.→CLS token 6.-attach classification head and do prediction,etc. Commonly trained with a self-supervised objective(ex.next token prediction)
Recall: Transformer Architecture Start with text string 1. → text tokens 2. → text embedding vectors (via embedding dictionary) 3. → text/position embedding vectors 4. → stacks transformer layers (self-attention + normalization + residual connections + MLP blocks) 5. → CLS token 6. → attach classification head and do prediction, etc. Commonly trained with a self-supervised objective (ex. next token prediction)
MACHINE LEARNING BERKELEY Problem! Start with text string 1.→text tokens 2.-text embedding vectors(via embedding dictionary) 3.text/position embedding vectors 4.stacks transformer layers(self-attention normalization residual connections MLP blocks) 5.→CLS token 6.attach classification head and do prediction,etc. Commonly trained with a self-supervised objective(ex.next token prediction)
Problem! Start with text string 1. → text tokens 2. → text embedding vectors (via embedding dictionary) 3. → text/position embedding vectors 4. → stacks transformer layers (self-attention + normalization + residual connections + MLP blocks) 5. → CLS token 6. → attach classification head and do prediction, etc. Commonly trained with a self-supervised objective (ex. next token prediction)
MACHINE LEARNING BERKELEY Naive Solution(imageGPT) Paper:"Generative Pretraining from Pixels" Pixels are kinda discrete-just treat each color value like a separate word in your vocabulary! o Each pixel is commonly represented by a 24 bit value(integers in the range [O,255]for each of the 3 color channels) o Vocab size of 2^24=16,777,216! Who needs that many colors anyway? o Use a 9 bit representation (integers in the range [0,8]for each of the 3 color channels) o Vocab size of 512 Read pixels from raster order(row by row from left to right)to get input sequence
Naive Solution (imageGPT) Paper: “Generative Pretraining from Pixels” ● Pixels are kinda discrete — just treat each color value like a separate word in your vocabulary! ○ Each pixel is commonly represented by a 24 bit value (integers in the range [0, 255] for each of the 3 color channels) ○ Vocab size of 2^24 = 16,777,216! ● Who needs that many colors anyway? ○ Use a 9 bit representation (integers in the range [0, 8] for each of the 3 color channels) ○ Vocab size of 512 ● Read pixels from raster order (row by row from left to right) to get input sequence