Sparsely-gated Mixture Of Experts (MoE)

In transformer models, the attention block is typically followed by a feed forward layer (FF), which is a simple fully-connected NN with a hidden layer and nonlinearity. Here's the code for such a block that uses ReLU:def feed_forward_relu(x, W1, W2): """Feed-forward layer with ReLU activation.Args: x: Input …