Eli Bendersky's
Follow
Sparsely-gated Mixture Of Experts (MoE)
In transformer models, the
attention block
is typically followed by a feed forward layer (FF), which is a simple fully-connected
NN with a hidden layer and nonlinearity. Here's the code for such a block that
uses ReLU:def feed_forward_relu(x, W1, W2):
"""Feed-forward layer with ReLU activation.Args:
x: Input …