当前位置：和泉文库 > 计算机 > 浏览文档

广东工业大学：《机器学习》课程教学资源（课件讲义）第20讲预训练模型 Pre-training of Deep Bidirectional Transformers for Language Understanding（授课：周郭许）

文件格式：PDF，文件大小：2.71MB，售价：9.74元

文档详细内容（约43页）

Research context General attention versus self-attention .y12 yo y:y2 attention Self-attention ko k:k2 90919z 国产之小丝 2024/5/13 6

Research context 2024/5/13 6 General attention versus self-attention x0 x1 x2 y0 y1 y2 attention y0 y1 y2 k0 k1 k2 v0 v1 v2 q0 q1 q2 Self-attention

Research context The Transformer encoder block yo y1 y2 y3 C0.0 Co.1 C02…C2.2 Transformer Encoder Block: Layer norm Inputs:Set of vectors x ⊕ Outputs:Set of vectors y X N MLP Self-attention is the only 一 interaction between vectors. Layer norm ⊕ Layer norm and MLP operate independently per vector. Multi-head self-attention 二 Highly scalable,highly Positional encoding parallelizable,but high memory usage. Zo.o Z02…2.2 Vaswani et al,"Attention is all you need",NeurlPS 2017 国产大 2024/5/13 7

Research context 2024/5/13 7 The Transformer encoder block Multi-head self-attention Layer norm Layer norm MLP y0 y1 y2 y3 c0,2 c0,1 c0,0 c2,2 ... Transformer encoder Positional encoding parallelizable, but high memory usage. z ... 0,0 z0,1 z0,2 z2,2 x0 x1 x2 x2 Vaswani et al, “Attention is all you need”, NeurIPS 2017 x N Transformer Encoder Block: Inputs: Set of vectors x Outputs: Set of vectors y Self-attention is the only interaction between vectors. Layer norm and MLP operate independently per vector. Highly scalable, highly + +

Research context The Transformer yoy1 y2 y3 decoder block FC Transformer Decoder Block: Layer norm Inputs:Set of vectors x and ⊕ Set of context vectors c. y1 y2 y3 MLP Outputs:Set of vectors y. Layer norm Masked Self-attention only Transformer ⊕ interacts with past inputs. Co.o Multi-head attention Co-1 q Multi-head attention block is NOT self-attention.It attends Co.o decoder 02 Layer norm over encoder outputs. Co.1 +⊕ Highly scalable,highly Co.2 C2.2 Masked Multi-head parallelizable,but high memory self-attention usage. Positional encoding C2.2 yo y y2 y3 [START]person wearing hat X1X2 X3 Vaswani et al,"Attention is all you need",NeurlPS 2017 重)亲大 2024/5/13 8

Research context 2024/5/13 8 The Transformer decoder block Transformer Decoder Block: Inputs: Set of vectors x and Set of context vectors c. Outputs: Set of vectors y. Masked Self-attention only interacts with past inputs. Multi-head attention block is NOT self-attention. It attends over encoder outputs. Highly scalable, highly parallelizable, but high memory usage. c0,2 c0,1 c0,0 c2,2... Transformer decoder x N y0 [START] y1 y2 person wearing hat y3 y0 y1 y2 y3 Vaswani et al, “Attention is all you need”, NeurIPS 2017 Positional encoding x0 x1 x2 x3 Masked Multi-head self-attention Layer norm Layer norm MLP y0 y1 y2 y3 c0,2 c0,1 c0,0 c2,2... FC Multi-head attention k v q + + Layer norm +

Research context Example:CNN with Self-Attention Queries: Attention Weights C'xHxW Transpose (HxW)x (HxW) Input Image 1x1 Conv softmax CxHxH Keys: CNN C'xHxW 1x1 Conv Features: CxHxW C'xHxW Values: C'xHxW 1x1 Conv 1x1 Conv 国产之大当 2024/5/13 9

Research context 2024/5/13 9 Example: CNN with Self-Attention Cat image is free to use under the Pixabay License Input Image CNN Features: C x H xW Queries: C’ x H xW Keys: C’ x H xW Values: C’ x H xW 1x1 Conv 1x1 Conv 1x1 Conv x Transpose softmax Attention Weights (H x W) x (H xW) x C’ x H xW 1x1 Conv C x H xH

Research context Example:Image Captioning using transformers person wearing hat [END] y1 乙00Z01乙02 Co.0 Co.1 C0.2 C2.2 Z1,0 11Z1.2 Transformer decoder CNN 202122 Transformer encoder Extract spatial Features: features from a HxWXD .222 yo y2 y3 pretrained CNN [START]person wearing hat 国产之小生 2024/5/13 10

Research context 2024/5/13 10 Example: Image Captioning using transformers Extract spatial features from a pretrained CNN CNN Features: H x W x D z0,0 z0,1 z0,2 z1,0 z1,1 z1,2 z2,0 z2,1 z2,2 z0,0 z0,1 z0,2 z2,2 ... Transformer encoder c0,1 c0,0 c0,2 c2,2 ... y0 [START] person wearing hat y1 y2 y1 y3 y2 y4 y3 person wearing hat [END] Transformer decoder

点击进入文档下载页（PDF格式）

共43页，可试读15页，点击继续阅读 ↓↓

您可能感兴趣的文档

广东工业大学：《机器学习》课程教学资源（课件讲义）第19讲 ViT及注意力机制改进（各式各样的Attention）
广东工业大学：《机器学习》课程教学资源（课件讲义）第19讲 ViT及注意力机制改进（Vision Transformers ,ViTs）
广东工业大学：《机器学习》课程教学资源（课件讲义）第18讲变换器模型 Transformer
广东工业大学：《机器学习》课程教学资源（PPT讲稿）第18讲变换器模型 Transformer
广东工业大学：《机器学习》课程教学资源（课件讲义）第17讲注意力机制（自注意力）
广东工业大学：《机器学习》课程教学资源（PPT讲稿）第17讲注意力机制（概述）
广东工业大学：《机器学习》课程教学资源（PPT讲稿）第16讲现代循环神经网络（嵌入向量, 词嵌入, 子词嵌入, 全局向量的词嵌入）
广东工业大学：《机器学习》课程教学资源（PPT讲稿）第16讲现代循环神经网络（编码器解码器，Seq2seq模型，束搜索）
广东工业大学：《机器学习》课程教学资源（PPT讲稿）第16讲现代循环神经网络（高级循环神经网络）
广东工业大学：《机器学习》课程教学资源（课件讲义）第15讲无监督学习——降维深度学习可视化（Neighbor Embedding，LLE T-SNE）
广东工业大学：《机器学习》课程教学资源（课件讲义）第15讲无监督学习——降维深度学习可视化（PCA Kmeans）
广东工业大学：《机器学习》课程教学资源（课件讲义）第14讲循环神经网络（RNN）
广东工业大学：《机器学习》课程教学资源（课件讲义）第21讲生成式网络模型（自编码器 Deep Auto-encoder）
广东工业大学：《机器学习》课程教学资源（课件讲义）第21讲生成式网络模型（VAE Generation）
广东工业大学：《机器学习》课程教学资源（课件讲义）第22讲生成式网络模型（Diffusion Model）
广东工业大学：《机器学习》课程教学资源（课件讲义）第22讲生成式网络模型（Stable Diffusion）
北京信息科技大学：计算机学院各专业课程教学大纲汇编
北京信息科技大学：计算中心及图书馆课程教学大纲汇编
新乡学院：数学与统计学院信息与计算科学专业《数学分析Ⅰ》课程教学大纲（2015）
新乡学院：数学与统计学院信息与计算科学专业《数学分析Ⅱ》课程教学大纲（2015）
新乡学院：数学与统计学院信息与计算科学专业《数学分析Ⅲ》课程教学大纲（2015）
新乡学院：数学与统计学院信息与计算科学专业《高等代数Ⅰ》课程教学大纲（2015）
新乡学院：数学与统计学院信息与计算科学专业《高等代数Ⅱ》课程教学大纲（2015）
新乡学院：数学与统计学院信息与计算科学专业《解析几何》课程教学大纲（2015）

点击购买下载（PDF）

下载及服务说明

购买前请先查看本文档预览页，确认内容后再进行支付；
如遇文件无法下载、无法访问或其它任何问题，可发送电子邮件反馈，核实后将进行文件补发或退款等其它相关操作；
邮箱：

文档浏览记录