Self-attentionSelf-AttentionKeylKey2Key3Kev4KeylKey2Key3Key4AttentionQueryValueStep1QueryF(Q,K)F(QK)FIQKF(Q,K)ValuelValue2Value3Value4s2s3s4SSourceSoftMax()Step2Calculationprocess:Step 1:calculatingthesimilarityAttentionbetweenqueryandkeytoget theValueweightsStep3Step2:normalizingtheweightsValuelValue2Value3Value4
Self-attention l Self-Attention Step 1 Step 2 Step 3 Calculation process: lStep 1: calculating the similarity between query and key to get the weights lStep 2: normalizing the weights
Self-attentionSelf-AttentionKeylKey2Key3KeyKey2Key3Key4KeylAttentionQueryValueStep1QueryF(Q,K)F(QK)F(QK)F(Q,K)ValuelValue2Value3Value4s2s3$4s1SourceSoftMax(Step2Calculationprocess:Step1:calculatingthesimilarityAttentionbetweenqueryandkeytogettheValueweightsStep3Step2:normalizingtheweightsValuelValue2Value3Value4Step3:Summingtheweightedvaluetogetthehiddenstate
Self-attention l Self-Attention Step 1 Step 2 Step 3 Calculation process: lStep 1: calculating the similarity between query and key to get the weights lStep 2: normalizing the weights lStep 3: Summing the weighted value to get the hidden state
Outlines1.Self-attention2.Transformer3. Pre-training LM
Outlines 1. Self-attention 2. Transformer 3. Pre-training LM
TransformerAs of last week:recurrent modelsfor (most)NLP!Circa2016,thedefactostrategyinNLPistoencodesentenceswithabidirectionalLSTM:(forexample,thesourcesentenceinatranslation)Defineyour output (parse, sentence,summary)asasequence,anduseanLsTMtogenerate(decode)it.交通大学Useattentiontoallowflexibleaccess11111tomemory
Transformer l As of last week: recurrent models for (most) NLP! • Circa 2016, the de facto strategy in NLP is to encode sentences with a bidirectional LSTM: (for example, the source sentence in a translation) • Define your output (parse, sentence, summary) as a sequence, and use an LSTM to generate(decode) it. • Use attention to allow flexible access to memory
TransformerToday:Samegoals,differentbuilding blocksLast week, welearnedabout sequence-to-sequence problemsandencoder-decodermodels交道大学
Transformer l Today: Same goals, different building blocks • Last week, we learned about sequence-to-sequence problems and encoder-decoder models