Pre-training LMPretraining decodersIt's natural to pretrain decoders as language models andthen usethem as generators, finetuning their (1:-1)!W2W3W4WsW6A,bh,...hW1W2W3W4Ws福[Notehowthe linear layer has beenpretrained.j
Pre-training LM It’s natural to pretrain decoders as language models and then use them as generators, finetuning their � (� |�1: −1 ) ! [Note how the linear layer has been pretrained.] l Pretraining decoders
Pre-training LMPretraining decodersIt's natural to pretrain decoders as language models andthen usethem as generators, finetuning their (11:-1)!W2W3W4WsW6This is helpful intasks wherethe output isa sequenceA,bwithavocabularylikethat atpretrainingtime!th,.,hrDialogue(context=dialoguehistory)Summarization(context=document)h..., h. = Decoder(wi..., w.)W1W2W3W4Wsw, ~ Aw-- +b[Notehowthe linearlayerhasbeenpretrained.j
Pre-training LM It’s natural to pretrain decoders as language models and then use them as generators, finetuning their � (� |�1: −1 ) ! This is helpful in tasks where the output is a sequence with a vocabulary like that at pretraining time! 1 1 ,., Decoder( ,., ) T T h h w w wt ~ Awt 1 b [Note how the linear layer has been pretrained.] • Dialogue (context=dialogue history) • Summarization (context=document) l Pretraining decoders
Pre-trainingLMPretraining decodersIt's natural to pretrain decoders as language models andthen usethem as generators, finetuning their (11:-1)!W2W3W4WsW6This is helpful intasks wherethe output isa sequenceA,bwitha vocabulary likethatatpretrainingtime!thu,..,hrDialogue(context=dialoguehistory)Summarization(context=document)h,.., h. = Decoder(wi..., w.)W1W2W3W4Wsw, ~ Aw-- +b[Note howthe linear layerhasbeenpretrained.jWhere,were pretrained in the language model!
Pre-training LM It’s natural to pretrain decoders as language models and then use them as generators, finetuning their � (� |�1: −1 ) ! This is helpful in tasks where the output is a sequence with a vocabulary like that at pretraining time! 1 1 ,., Decoder( ,., ) T T h h w w wt ~ Awt 1 b Where �, � were pretrained in the language model ! [Note how the linear layer has been pretrained.] • Dialogue (context=dialogue history) • Summarization (context=document) l Pretraining decoders
Outlines1. Pre-training LM2. GPT3.Bert4. T5
Outlines 1. Pre-training LM 2. GPT 3. Bert 4. T5
Pre-trainingLMGenerativePretrainedTransformer(GPT)2018's GPT was a big success in pretraining a decoder!交通大学RadfordA,NarasimhanK,SalimansT,etal.Improving language understanding by generative pre-training[J].2018
Pre-training LM l Generative Pretrained Transformer (GPT) 2018’s GPT was a big success in pretraining a decoder! Radford A, Narasimhan K, Salimans T, et al. Improving language understanding by generative pre-training[J]. 2018