TransformerToday:Samegoals,differentbuilding blocks5Lastweek,welearnedaboutsequence-to-sequenceproblemsandencoder-decodermodels.Today, we're not trying to motivate entirely new ways of looking atproblemsInstead, we're trying to find the best building blocks to plug intoourmodels and enablebroad progress.交通大学Lots of trialEanderror20212014-2017??????Recurrence
Transformer l Today: Same goals, different building blocks • Last week, we learned about sequence-to-sequence problems and encoder-decoder models. • Today, we’re not trying to motivate entirely new ways of looking at problems • Instead, we’re trying to find the best building blocks to plug into our models and enable broad progress. 2014-2017 Recurrence Lots of trial and error 2021 ??????
TransformerIssueswithrecurrent models:Linearinteraction distanceRNNs are unrolled "left-to-right"This encodes linear locality: a useful heuristic!? Nearby words often affect each other's meaningstastypizza支通大学
Transformer l Issues with recurrent models: Linear interaction distance • RNNs are unrolled “left-to-right”. • This encodes linear locality: a useful heuristic! • Nearby words often affect each other’s meanings tasty pizza
TransformerIssues with recurrent models:Linear interaction distanceRNNs are unrolled"left-to-right"This encodes linear locality:a useful heuristic!?Nearby words often affect each other's meaningstastypizzaProblem:RNNstake O(sequencelength)stepsfordistant word pairs to interact.O(sequencelength)The chef whowas
Transformer l Issues with recurrent models: Linear interaction distance • RNNs are unrolled “left-to-right”. • This encodes linear locality: a useful heuristic! • Nearby words often affect each other’s meanings tasty pizza The chef who . was • Problem: RNNs take O(sequence length) steps for distant word pairs to interact. O(sequence length)
TransformerIssueswith recurrent models:Linearinteraction distance. O(sequence length) steps for distant word pairs tointeract means:·Hard tolearn long-distancedependencies(becausegradientproblems!Thechef who.was
Transformer l Issues with recurrent models: Linear interaction distance • O(sequence length) steps for distant word pairs to interact means: • Hard to learn long-distance dependencies (because gradient problems! The chef who . was
TransformerIssues with recurrent models:Linear interaction distance.O(sequence length) steps for distant word pairs tointeract means:·Hardto learnlong-distance dependencies (becausegradientproblems!)·Linear order of words is“baked in";we already know linear orderisn't the right wayto think about sentences...The chef who..wasInfo of chef has gonethroughO(sequencelength)manylayers!
Transformer l Issues with recurrent models: Linear interaction distance • O(sequence length) steps for distant word pairs to interact means: • Hard to learn long-distance dependencies (because gradient problems!) • Linear order of words is “baked in”; we already know linear order isn’t the right way to think about sentences. The chef who . was Info of chef has gone through O(sequence length) many layers!