Pre-training LMPretraining Language Models with three architecturesLanguage models!Whatwe've seen so far.DecodersNice to generate from; can't condition on future wordsGetsbidirectional context-can condition onfuture!EncodersWait,how do we pretrain them?交道大学Encoder-Goodpartsofdecodersandencoders?Whats thebest wayto pretrain them?Decoders
Pre-training LM l Pretraining Language Models with three architectures Decoders • Language models! What we've seen so far. • Nice to generate from; can't condition on future words Encoders • Gets bidirectional context – can condition on future! • Wait, how do we pretrain them? Encoder- Decoders • Good parts of decoders and encoders? • What’s the best way to pretrain them?
Pre-trainingLMPretrainingLanguage ModelswiththreearchitecturesPretraining for three types of architecturesThe neural architecture influences the type of pretraining, and natural use cases.Languagemodels!Whatwe'veseensofarDecodersNice to generate from; can't condition on future wordsancondiiononfutEncoderswepretrairihem?逸大Encoderparts-ofdecoDecoders
Pre-training LM Pretraining for three types of architectures The neural architecture influences the type of pretraining, and natural use cases. • Language models! What we’ve seen so far. • Nice to generate from; can’t condition on future words Decoders • Gets bidirectional context – can condition on future! • Wait, how do we pretrain them? Encoders Encoder- Decoders • Good parts of decoders and encoders? • What’s the best way to pretrain them? l Pretraining Language Models with three architectures
Pre-trainingLMPretrainingdecodersWhen using language modelpretrained decoders,wecan ignorethattheyweretrainedtomodel(/1:-1)交通大学
Pre-training LM When using language model pretrained decoders, we can ignore that they were trained to model �(� |�1: −1 ) l Pretraining decoders
Pre-training LMPretraining decodersWhen using language modelpretrained decoders,wecan ignorethattheyweretrainedtomodel(/1:-1)?/?A,bLinearWe canfinetunethembytraininga classifieronthelastword'shiddenstate.hi,..,hrh,....h, =Decoder(wi..., Wr)y~ Awr +bWi..,WT交通大[Notehowthelinearlayerhasn'tbeenpretrainedandmustbelearnedfromscratch.]
Pre-training LM When using language model pretrained decoders, we can ignore that they were trained to model �(� |�1: −1 ) We can finetune them by training a classifier on the last word’s hidden state. 1 1 ,., Decoder( ,., ) T T h h w w ~ T y Aw b [Note how the linear layer hasn’t been pretrained and must be learned from scratch.] l Pretraining decoders
Pre-training LMPretraining decodersWhen using language modelpretrained decoders,wecan ignorethat they were trained to model ( 11: -1)/?A,bLinearWecan finetunethembytraininga classifier onthelastword'shiddenstate.hi,..,hrh,.... h, = Decoder(w...., W.)y~ Awr +bWi,...,WT通大Where andarerandomly initialized[Notehowthe linear layerhasn'tbeenandspecifiedbythedownstreamtaskpretrainedandmustbelearnedfromscratch.]Gradientsbackpropagatethroughthewholenetwork
Pre-training LM When using language model pretrained decoders, we can ignore that they were trained to model �(� |�1: −1 ) We can finetune them by training a classifier on the last word’s hidden state. 1 1 ,., Decoder( ,., ) T T h h w w ~ T y Aw b Where � and � are randomly initialized and specified by the downstream task. Gradients backpropagate through the whole network. [Note how the linear layer hasn’t been pretrained and must be learned from scratch.] l Pretraining decoders