Many Different Choices . ·Longformer https://arxiv.org/abs/2004.05150 (b)Sliding window attention (c)Dilated sliding window (d)Global+sliding window 。Big Bird https://arxiv.org/abs/2007.14062 (a)Random attention (b)Window attention (c)Global Attention (d)BIGBIRD
Many Different Choices … • Longformer • Big Bird https://arxiv.org/abs/2004.05150 https://arxiv.org/abs/2007.14062
Can we only focus on Critical Parts? key small value ·Directly set to 0 AJanb ·Smaller influence on results large value How to quickly estimate the portion with small attention weights?
Can we only focus on Critical Parts? key query large value small value • Directly set to 0 • Smaller influence on results How to quickly estimate the portion with small attention weights?
Reformer Clustering https://openreview.net/forum?id=rkgNKkHtvB Routing Transformer https://arxiv.org/abs/2003.05997 Step 1 query key IITIIT Clustering based on similarity (approximate fast)
Clustering query key Reformer https://openreview.net/forum?id=rkgNKkHtvB Routing Transformer https://arxiv.org/abs/2003.05997 Clustering based on similarity 1 4 1 2 3 3 3 2 2 1 3 3 1 4 1 4 (approximate & fast) Step 1
key Clustering 0001 Step 2 KJanb Belong to the same cluster,then Not the same cluster, calculate attention weight set to 0
Clustering key query Step 2 Belong to the same cluster, then calculate attention weight Not the same cluster, set to 0