GLU变体可提升Transformer

作者: wyli

时间: 2024-04-03

684 次阅读

Transformer由多头注意力和FFN交替形成的模型。其中，FFN的输入为向量$x$，再经过两个线性变形。在这两个线性变形之间的激活函数为ReLU，可见式(1)
$$
\begin{aligned}
FFN(x,W_1,W_2,b_1,b_2)=max(0,xW_1+b_1)W_2+b_2
\end{aligned}\tag{1}
$$
其无偏差版本为
$$
\begin{aligned}
FFN_{ReLU}(x,W_1,W_2)=max(xW_1,0)W_2
\end{aligned}\tag{2}
$$
随后，也有利用GELU$(x)=x\Phi(x)$或Swish$_{\beta}(x)=x\sigma(\beta x)$代替Relu为激活函数。

GLU及其变体

激活函数GLU为一个神经网络层被定义为输入的两个线性变形之间克罗内克积，且其中一个经过sigmoid激活。若两个均无激活，那么被称为双线性。
$$
\begin{aligned}
GLU(x,W,V,b,c)=\sigma(xW+b)\otimes(xV+c) \\
Bilinear(x,W,V,b,c)=(xW+b)\otimes(xV+c)
\end{aligned}\tag{4}
$$
若利用其它激活函数，那么GLU的变体为
$$
\begin{aligned}
ReGLU(x,W,V,b,c)=max(0,xW+b)\otimes(xV+c) \\
GEGLU(x,W,V,b,c)=GELU(xW+b)\otimes(xV+c) \\
SwiGLU(x,W,V,b,c,\beta)=Swish_{\beta}(xW+b)\otimes(xV+c)
\end{aligned}\tag{5}
$$
最初，GLU的提出是为了应对RNN的无法并行计算的问题，若把GLU及其变体应用到Transformer的FFN中，那么
$$
\begin{aligned}
FFN_{GLU}(x,W,V,W_2)&=(\sigma(xW)\otimes xV)W_2 \\
FFN_{Bilinear}(x,W,V,W_2)&=(xW\otimes xV)W_2 \\
FFN_{ReGLU}(x,W,V,W_2)&=(max(0,xW)\otimes xV)W_2 \\
FFN_{SwiGLU}(x,W,V,W_2)&=(Swish_1(xW)\otimes xV)W_2
\end{aligned}\tag{6}
$$
与原始FFN相比，拥有三个权重矩阵。为了保持参数量与计算量不变，常减少隐藏层的数量$d_{ff}$为原始的$\frac{2}{3}$。

引用方法

请参考：

li,wanye. "GLU变体可提升Transformer". wyli'Blog (Apr 2024). https://www.robotech.ink/index.php/archives/387.html

或BibTex方式引用：

@online{eaiStar-387, title={GLU变体可提升Transformer}, author={li,wanye}, year={2024}, month={Apr}, url="https://www.robotech.ink/index.php/archives/387.html" }

GLU变体可提升Transformer

GLU及其变体

引用方法

添加新评论

最新文章

标签云 (Top20)

分类